[Vm-dev] rough idea for per segment Lilliputian lists

Tue Oct 9 07:52:40 UTC 2018

Hi Ben,

What you are saying makes a lot of sense.

Some context:

Free chunks are organized, like my last blog post describes, as a 64 words
array structure. Word 2 to 63 are either 0 or the first free chunk of size
2 to 63 respectively. Word 0 is used as a reference to the free tree, a
binary tree holding in each node a reference to the first free chunk of
size > 63. Large chunks are referenced through the free tree, small chunks
directly through the 64 words array structure.

When I started to work on SelectiveCompactor, all the free chunks were
organized from there as single linked lists. That worked well since to
allocate a chunk, one can just take the first one of the list, and in the
PlanningCompact algorithm, compacting reset the freelists and just add back
a few large free chunks at the end of the compaction. Now when I built the
sweep algo, I first tried to remove free chunks from the free list when
iterating over memory to concatenate unmarked objects and existing free
chunks. Obviously at 2Gb sweep was already horribly slow since removing an
element from a single linked list to concatenate it is O(N). My first idea
was to organize the free chunks in double linked list instead of linked
list, to remove them from the list in O(1) instead of O(N). That introduced
the concept of lilliputian, only present in 64 bits, free chunks of size 2
can have only the header and a pointer, hence they are the only ones
organized as a single linked list while the other are organized as a double
linked list. Removing a segment is now O(1) except for lilliputian.

However, that was still too slow for the sweep algorithm (removing the
chunk still require 2 extra random memory write which hurt cpu cache &
lilliputian long linked list were hurting performance). The last
implementation that is there just resets the free chunk linked lists before
sweeping and rebuilds them while sweeping.

In SelectiveCompact, the sweep phase sweeps the heap rebuilding the linked
list but also computes occupation of each segment. If there's at least 1
segment < 70% full, a selective compaction phase starts, selecting the
maximum number of segments, from the least occupied ones, and compact them
into an entirely empty segment. Typically it compacts 1-10 segments into 1.
In this context, compacted segments hold free chunks which need to be
cleared from the linked lists. Clearing free chunks organized as double
linked lists is fast enough, since a few segments are compacted. Clearing
lilliputian is still O(N), so clearing many of them lead to horrible
compaction time (O(N^2) multiple seconds instead of multiple ms at 20Gb
workload). So I needed to find a solution for lilliputian.

Solution:

As Ben suggested, one solution would be to organized the lilliputian single
linked list as a per segment linked list. I did not do that because it
would impact a lot free chunk organization, I changed it in the past and it
takes 2 days to make sure I did not break anything in yhe production VM.

So what I did instead first is to try to sort lilliputian while sweeping
(which is almost free I just need an extra variable and execution time is
not really impacted), iterate over the lilliputian linked list while
compacting segments, and since segments are in ascending addresses order,
that meant I could iterate at most once over the lilliputian linked list
while compacting (O(N^2) becomes O(N)). That reduced the stupid
pathological cases from multiple seconds to multiple hundred ms.

That was great, but still not enough (we target 16ms pause for 60fps apps).
So in addition I annotated while sweeping each segment with their last
lilliputian, which means that now I can iterate over the lilliputian linked
list per segment and remove the remaining overhead, leading to compaction
pauses of several ms.

I still clear the lilliputians while iterating over the segment in
compaction phase, clearing all the lilliputian of 1 segment at once is
indeed now possible, but it has no real performance benefit IMO since it
does not avoid any cache miss (prev lilliputian is in cache for the segment
iteration and current one has just been read from memory). Current impl
works so I've just kept it that way.

Now what you suggest Ben is better, no sorting of lilliputian while
sweeping, just keep a reference to the last lilliputian in the segment
(first one found in the segment since they end up in reverse address
order), and then clear all the segment lilliputians at once when compacting
it. It was just easier to make it work the way I did from what was
implemented when I did it.

I thought about doing something similar to what you described when I
finished but I don't think it would impact that much performance (it does
not remove read/writes with cache miss) so I was too lazy.

Right now I have 2 development images, one matching the trunk and one with
extra profiling changes for my paper, so it is difficult to change things.
But I may do that later when I finish this paper or you can change it
yourself.

Compiling with selective requires to change compactorClass and enable
tempvectorreadbarrier in slang compilation setting. The only thing is that
u cannot really snapshot right now with selective (the idea is to have both
selective and planning so that snapshots use planning and the rest of the
code use selective, see hybridcompactor, not sure that works so far though,
because without full heap compaction snapshots are quite bigger).

Damned that was a long mail. But you really nailed the problem. I've been
dying to talk about that to someone I'm glad you asked :-)

On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <btc at openinworld.com> wrote:

>
> On Mon, 8 Oct 2018 at 02:08, <commits at source.squeak.org> wrote:
>
>> ClementBera uploaded a new version of VMMaker to project VM Maker:
>>
>>   SpurSelectiveCompactor compacts memory by selecting the memory segments
>> with the most free space and compacting only those, to limit fragmentation
>> while being really quick to perform. The algorithm is fast mostly because
>> it does not update pointers: they are updated lazily during the next
>> marking phase, so there is no need to read the fields of objects in other
>> memory segments that the one compacted.
>>
>>   The algorithm works as follow. First, a global sweep pass iterates over
>> the memory linearly, changing unmarked objects to free chunks and
>> concatenating free chunks. During the global sweep phase, the segments of
>> the heap are analysed to determine the percentage of occupation. Second,
>> the least occupied segments are compacted by copying the remaining live
>> objects into an entirely free segment, called regionToFill (we detail later
>> in the paragraph where regionToFill comes from), changing their values to
>> forwarding objects and marking the free chunks as unavailable (removed from
>> free list and marked as data objects). Third, the next marking phase
>> removes all forwarders. Fourth, at the beginning of the next compaction
>> phase the compacted segments from the previous GC can be entirely marked as
>> free space (No need to check anything inside, there were only forwarders
>> and trash data). One of the compacted segment is then selected as the
>> segmentToFill, others are just marked as free chunks.
>>
>>
>>   The compaction is effectively partial, compacting only the most
>> critical segments of the heap to limit fragmentation. Compaction time is
>> crazy low, since a low number of objects are moved and pointer updated is
>> lazily done during the next marking phase, while still preventing memory
>> fragmentation.
>>
>>   Now this works well when biasForGC is true, but when performing a
>> snapshot, the compactor is just total crap (we need to figure out a
>> solution).
>>
>>   Performance trick for lilliputian chunks:
>> + Specific free chunks (called lilliputian, see isLilliputianSize:) are
>> managed using a single linked list instead of a double linked list since
>> there's not enough room in the free chunk for the back pointer. During the
>> sweep phase this is not a problem since we're rebuilding the free chunk
>> structure, but during selective compaction we're detaching free chunks from
>> the free chunk structure and that can be horribly slow (10 seconds
>> sometimes at 20Gb heap due to many iteration over the single linked list).
>> To work around this problem, the sweep phase use lastLilliputianChunk
>> variable to sort the lilliputian free chunk single linked list in ascending
>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>> selective compation phase, the same variable is re-used to iterate at most
>> once over the single linked list while detaching lilliputian chunks (See
>> incrementalUnlinkSmallChunk:). In addition, each segment is annotated
>> during the sweep phase with the last lilliputian chunk it
>>   holds. Hence, during the compaction phase, the linked list is iterated
>> but the iteration can jump to the last chunk of the previous segment to
>> compact.!
>>
>
>
>> - Specific free chunks (called lilliputian, see isLilliputianSize:) are
>> managed using a single linked list instead of a double linked list since
>> there's not enough room in the free chunk for the back pointer. During the
>> sweep phase this is not a problem since we're rebuilding the free chunk
>> structure, but during selective compaction we're detaching free chunks from
>> the free chunk structure and that can be horribly slow (10 seconds
>> sometimes at 20Gb heap due to many iteration over the single linked list).
>> To work around this problem, the sweep phase use lastLilliputianChunk
>> variable to sort the lilliputian free chunk single linked list in ascending
>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>> selective compation phase, the same variable is re-used to iterate at most
>> once over the single linked list while detaching lilliputian chunks (See
>> incrementalUnlinkSmallChunk:).
>>
>
> Please excuse me that I only skimmed the implementation and so don't have
> a good grasp if what I'm saying is practical,
> but anyway reading the above description a random idea floating along the
> wind found me...
>
> Have you considered per-segment-Lilliputian-list?
> Reserve the first word of each segment as the pointer to the segment's
> first free-Lilliputian-chunk.  If it is null there are no
> free-Lilliputian-chunks in that segment.
> The global sweep easily(?) knows which segment a Lilliputian-free-chunk
> belongs to and prepends to that per-segment list.
>
> To combine all those separate per-segment lists into an overall-list...
> split that reserved word into two parts...
> * the LSB part provides an "offset" to the first element of the
> per-segment-list
> * the MSB part provides a pointer to the next segment with a
> per-segment-list i.e. "next-per-segment-list".
> Also btw, Lilliputian-free-chunks only require space to hold the
> "next-offset" within its segment
>
> To allocate from the overall-list...
>   given a root pointer "X", follow to the first segment,
>   allocate chunk at "offset",
>   update "offset" with "next-offset",
>   if "offset" is zero, update "X" with "next-per-segment-list"
>
> Reserve a second word at the start of each segment and you have a back
> pointer to the previous segment.
> So when a segment is removed, its easy to remove all its
> Lilliputian-free-chunks out of the overall-list in one go.
> The Lilliputian-free-chunk-overall-list would be like a double-linked-list
> with offshoots.
>
> Maybe(?) then the Lilliputian never needs to be sorted,
> and also doesn't need to be traversed when freeing a segment, just remove
> the segment from the overall-list.
>
> cheers -ben
>
>
> P.S. Please feel free to play "idea"-whac-a-mole.   I've not put a lot of
> critical thinking into it.   It was easier to share than stop it bubbling
> about.
> https://en.wikipedia.org/wiki/Whac-A-Mole
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20181009/585075a1/attachment-0001.html>