[squeak-dev] Inconsistent growing behavior of Streams and Collections
leves
leves at caesar.elte.hu
Tue Feb 28 18:19:02 UTC 2023
Hi Marcel,
As long as growth is exponential, the time complexity of adding a new
element to the collection is amortized O(1) with a constant depending on
the base.
On 2023. 02. 27. 11:48, Marcel Taeumel via Squeak-dev wrote:
> Hi all --
>
> What is the best growing strategy we currently have and can we use it
> consistently in the image?
Everything looks consistent to me except for WriteStream. But it may as
well stay unique (see below). Each of those classes solve different
problems.
>
> 1) HashedCollection (i.e., Dictionary etc.)
> - growing is rather clever via prime table and max. 75% load factor
> - see #growSize
Base 2+ exponential growth.
>
> 2) OrderedCollection
> - grow if less than 50% free space (see #makeRoomAt*)
That's a special case to avoid repeatedly moving elements from one end
to the other in the internal array. The idea is to trigger growth ahead
of time to avoid quadratic runtime when elements are added to both ends
of the collection.
> - grow by 100% (see #growAt*)
Exponential growth with base 2.
>
> 3) WriteStream (e.g., String streamContents: [...])
> - !! grow by 100% (at most 1 Million !) via #grownBy: for
> #nextPut:/#pastEndPut: (and #nextPutAll: for Strings)
> - grow by min+25% via #growTo: for #nextPutAll: (for non-strings)
This is where your problem is.
#pastEndPut: has a one million limit on the maximum growth. That may
have been useful back in the early 2000s when it was introduced, but
nowadays it's pretty much a hurdle. Let's remove that. Once done, the
growth will be the common exponential with base 2.
#growTo: is really weird.
Senders assume that it grows to the size passed to it, so they add 10 to
the expected new end to have some extra slots, but it actually adds an
extra 25% of the old size to that or 20 slots, whichever is larger above
the requested amount.
The extra 25% provides exponential growth with base 1.25, which is very
conservative.
If you want unified behavior, increase the growth to 100%. I'd also get
rid of those extra +10s.
But, there's a catch with using base 2. When you get near the limits of
your machine your code may fail sooner than expected.
E.g. if you want to build a 6GB string on your machine with 16GB RAM, it
may just run out of memory.
If the buffer grows close to 6GB, it'll try grow to 12GB, but you do not
have enough RAM for that.
Had the value been lower, e.g. 1.5, the final buffer size would have
been 9GB, so your code would have had a chance to succeed.
Perhaps, the best way to tackle that is the following strategy: if
allocating the new array fails, request one with an exponentially
smaller increment up until it succeeds. It may never succeed, but then
you know you need a better machine.
>
> ---
>
> Well, using "String streamContents: [...]" is *really* slow for very
> large strings (e.g., > 100 MB) bc. it triggers a lot of full GCs bc.
> #grownBy: only adds a 1 megabyte at most per grow. Therefore, our JSON
> support suffers a serious performance impact when using #asJsonString on
> larger structures.
Just remove the one million limit from #pastEndPut:.
Levente
More information about the Squeak-dev
mailing list
|