[squeak-dev] Inconsistent growing behavior of Streams and Collections

Tue Feb 28 18:19:02 UTC 2023

Hi Marcel,

As long as growth is exponential, the time complexity of adding a new 
element to the collection is amortized O(1) with a constant depending on 
the base.

On 2023. 02. 27. 11:48, Marcel Taeumel via Squeak-dev wrote:
> Hi all --
> 
> What is the best growing strategy we currently have and can we use it 
> consistently in the image?

Everything looks consistent to me except for WriteStream. But it may as 
well stay unique (see below). Each of those classes solve different 
problems.

> 
> 1) HashedCollection (i.e., Dictionary etc.)
> - growing is rather clever via prime table and max. 75% load factor
> - see #growSize

Base 2+ exponential growth.

> 
> 2) OrderedCollection
> - grow if less than 50% free space (see #makeRoomAt*)

That's a special case to avoid repeatedly moving elements from one end 
to the other in the internal array. The idea is to trigger growth ahead 
of time to avoid quadratic runtime when elements are added to both ends 
of the collection.

> - grow by 100% (see #growAt*)

Exponential growth with base 2.

> 
> 3) WriteStream (e.g., String streamContents: [...])
> - !! grow by 100% (at most 1 Million !) via #grownBy: for 
> #nextPut:/#pastEndPut: (and #nextPutAll: for Strings)
> - grow by min+25% via #growTo: for #nextPutAll: (for non-strings)

This is where your problem is.

#pastEndPut: has a one million limit on the maximum growth. That may 
have been useful back in the early 2000s when it was introduced, but 
nowadays it's pretty much a hurdle. Let's remove that. Once done, the 
growth will be the common exponential with base 2.

#growTo: is really weird.
Senders assume that it grows to the size passed to it, so they add 10 to 
the expected new end to have some extra slots, but it actually adds an 
extra 25% of the old size to that or 20 slots, whichever is larger above 
the requested amount.

The extra 25% provides exponential growth with base 1.25, which is very 
conservative.
If you want unified behavior, increase the growth to 100%. I'd also get 
rid of those extra +10s.

But, there's a catch with using base 2. When you get near the limits of 
your machine your code may fail sooner than expected.
E.g. if you want to build a 6GB string on your machine with 16GB RAM, it 
may just run out of memory.
If the buffer grows close to 6GB, it'll try grow to 12GB, but you do not 
have enough RAM for that.
Had the value been lower, e.g. 1.5, the final buffer size would have 
been 9GB, so your code would have had a chance to succeed.

Perhaps, the best way to tackle that is the following strategy: if 
allocating the new array fails, request one with an exponentially 
smaller increment up until it succeeds. It may never succeed, but then 
you know you need a better machine.

> 
> ---
> 
> Well, using "String streamContents: [...]" is *really* slow for very 
> large strings (e.g., > 100 MB) bc. it triggers a lot of full GCs bc. 
> #grownBy: only adds a 1 megabyte at most per grow. Therefore, our JSON 
> support suffers a serious performance impact when using #asJsonString on 
> larger structures.

Just remove the one million limit from #pastEndPut:.

Levente