[Vm-dev] rough idea for per segment Lilliputian lists

Tue Oct 9 17:26:08 UTC 2018

We're talking about compaction pause right.

Current full GC pause with Selective is Mark pause + scavenge pause + Sweep
pause + compaction pause + shrink pause.

At 20Gb Moose workload, it's 17 sec + 25ms (avg, up to 68ms) + 1.5sec +
~20ms + 0 ms (up to 4ms)

The thing is that, research wise, Mark and sweep pauses can be incremental
or concurrent. So only scavenge and compaction are problematic (hardly
concurrent without redesigning the full memory manager in uncharted
territories). That means that in theory with this design we can have max
20ms pause.

On Tue, Oct 9, 2018 at 6:26 PM Ben Coman <btc at openinworld.com> wrote:

>
> On Tue, 9 Oct 2018 at 15:52, Clément Béra <bera.clement at gmail.com> wrote:
>
>> Solution:
>>
>> As Ben suggested, one solution would be to organized the lilliputian
>> single linked list as a per segment linked list. I did not do that because
>> it would impact a lot free chunk organization, I changed it in the past and
>> it takes 2 days to make sure I did not break anything in the production VM.
>>
>> So what I did instead first is to try to sort lilliputian while sweeping
>> (which is almost free
>>
>
> cool, thats good context. if fair to balance simplicity (i.e. reliability)
> versus optimization.
>
>
> I just need an extra variable and execution time is not really impacted),
>> iterate over the lilliputian linked list while compacting segments, and
>> since segments are in ascending addresses order, that meant I could iterate
>> at most once over the lilliputian linked list while compacting (O(N^2)
>> becomes O(N)). That reduced the stupid pathological cases from multiple
>> seconds to multiple hundred ms.
>>
>> That was great, but still not enough (we target 16ms pause for 60fps
>> apps). So in addition I annotated while sweeping each segment with their
>> last lilliputian, which means that now I can iterate over the lilliputian
>> linked list per segment and remove the remaining overhead, leading to
>> compaction pauses of several ms.
>>
>
> I missed that pauses were that small already.
>
>
> I still clear the lilliputians while iterating over the segment in
>> compaction phase, clearing all the lilliputian of 1 segment at once is
>> indeed now possible, but it has no real performance benefit IMO since it
>> does not avoid any cache miss (prev lilliputian is in cache for the segment
>> iteration and current one has just been read from memory). Current impl
>> works so I've just kept it that way.
>>
>> Now what you suggest Ben is better, no sorting of lilliputian while
>> sweeping, just keep a reference to the last lilliputian in the segment
>> (first one found in the segment since they end up in reverse address
>> order), and then clear all the segment lilliputians at once when compacting
>> it. It was just easier to make it work the way I did from what was
>> implemented when I did it.
>>
>
> cool. nice to know my understanding was on the right track and how other
> practical matters affect it
>
>
> I thought about doing something similar to what you described when I
>> finished but I don't think it would impact that much performance (it does
>> not remove read/writes with cache miss) so I was too lazy.
>>
>> Right now I have 2 development images, one matching the trunk and one
>> with extra profiling changes for my paper, so it is difficult to change
>> things. But I may do that later when I finish this paper or you can change
>> it yourself.
>>
>
> I'm sure I'd find it interesting to do so, but I'm already behind on some
> projects.
>
>
> Compiling with selective requires to change compactorClass and enable
>> tempvectorreadbarrier in slang compilation setting. The only thing is that
>> u cannot really snapshot right now with selective (the idea is to have both
>> selective and planning so that snapshots use planning and the rest of the
>> code use selective, see hybridcompactor, not sure that works so far though,
>> because without full heap compaction snapshots are quite bigger).
>>
>> Damned that was a long mail. But you really nailed the problem. I've been
>> dying to talk about that to someone I'm glad you asked :-)
>>
>
> I'm glad I can prompt the creative process.
>
> cheers -ben
>
>
>> On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <btc at openinworld.com> wrote:
>>
>>>
>>> On Mon, 8 Oct 2018 at 02:08, <commits at source.squeak.org> wrote:
>>>
>>>> ClementBera uploaded a new version of VMMaker to project VM Maker:
>>>>
>>>>   SpurSelectiveCompactor compacts memory by selecting the memory
>>>> segments with the most free space and compacting only those, to limit
>>>> fragmentation while being really quick to perform. The algorithm is fast
>>>> mostly because it does not update pointers: they are updated lazily during
>>>> the next marking phase, so there is no need to read the fields of objects
>>>> in other memory segments that the one compacted.
>>>>
>>>>   The algorithm works as follow. First, a global sweep pass iterates
>>>> over the memory linearly, changing unmarked objects to free chunks and
>>>> concatenating free chunks. During the global sweep phase, the segments of
>>>> the heap are analysed to determine the percentage of occupation. Second,
>>>> the least occupied segments are compacted by copying the remaining live
>>>> objects into an entirely free segment, called regionToFill (we detail later
>>>> in the paragraph where regionToFill comes from), changing their values to
>>>> forwarding objects and marking the free chunks as unavailable (removed from
>>>> free list and marked as data objects). Third, the next marking phase
>>>> removes all forwarders. Fourth, at the beginning of the next compaction
>>>> phase the compacted segments from the previous GC can be entirely marked as
>>>> free space (No need to check anything inside, there were only forwarders
>>>> and trash data). One of the compacted segment is then selected as the
>>>> segmentToFill, others are just marked as free chunks.
>>>>
>>>>
>>>>   The compaction is effectively partial, compacting only the most
>>>> critical segments of the heap to limit fragmentation. Compaction time is
>>>> crazy low, since a low number of objects are moved and pointer updated is
>>>> lazily done during the next marking phase, while still preventing memory
>>>> fragmentation.
>>>>
>>>>   Now this works well when biasForGC is true, but when performing a
>>>> snapshot, the compactor is just total crap (we need to figure out a
>>>> solution).
>>>>
>>>>   Performance trick for lilliputian chunks:
>>>> + Specific free chunks (called lilliputian, see isLilliputianSize:) are
>>>> managed using a single linked list instead of a double linked list since
>>>> there's not enough room in the free chunk for the back pointer. During the
>>>> sweep phase this is not a problem since we're rebuilding the free chunk
>>>> structure, but during selective compaction we're detaching free chunks from
>>>> the free chunk structure and that can be horribly slow (10 seconds
>>>> sometimes at 20Gb heap due to many iteration over the single linked list).
>>>> To work around this problem, the sweep phase use lastLilliputianChunk
>>>> variable to sort the lilliputian free chunk single linked list in ascending
>>>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>>>> selective compation phase, the same variable is re-used to iterate at most
>>>> once over the single linked list while detaching lilliputian chunks (See
>>>> incrementalUnlinkSmallChunk:). In addition, each segment is annotated
>>>> during the sweep phase with the last lilliputian chunk it
>>>>   holds. Hence, during the compaction phase, the linked list is
>>>> iterated but the iteration can jump to the last chunk of the previous
>>>> segment to compact.!
>>>>
>>>
>>>
>>>> - Specific free chunks (called lilliputian, see isLilliputianSize:) are
>>>> managed using a single linked list instead of a double linked list since
>>>> there's not enough room in the free chunk for the back pointer. During the
>>>> sweep phase this is not a problem since we're rebuilding the free chunk
>>>> structure, but during selective compaction we're detaching free chunks from
>>>> the free chunk structure and that can be horribly slow (10 seconds
>>>> sometimes at 20Gb heap due to many iteration over the single linked list).
>>>> To work around this problem, the sweep phase use lastLilliputianChunk
>>>> variable to sort the lilliputian free chunk single linked list in ascending
>>>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>>>> selective compation phase, the same variable is re-used to iterate at most
>>>> once over the single linked list while detaching lilliputian chunks (See
>>>> incrementalUnlinkSmallChunk:).
>>>>
>>>
>>> Please excuse me that I only skimmed the implementation and so don't
>>> have a good grasp if what I'm saying is practical,
>>> but anyway reading the above description a random idea floating along
>>> the wind found me...
>>>
>>> Have you considered per-segment-Lilliputian-list?
>>> Reserve the first word of each segment as the pointer to the segment's
>>> first free-Lilliputian-chunk.  If it is null there are no
>>> free-Lilliputian-chunks in that segment.
>>> The global sweep easily(?) knows which segment a Lilliputian-free-chunk
>>> belongs to and prepends to that per-segment list.
>>>
>>> To combine all those separate per-segment lists into an overall-list...
>>> split that reserved word into two parts...
>>> * the LSB part provides an "offset" to the first element of the
>>> per-segment-list
>>> * the MSB part provides a pointer to the next segment with a
>>> per-segment-list i.e. "next-per-segment-list".
>>> Also btw, Lilliputian-free-chunks only require space to hold the
>>> "next-offset" within its segment
>>>
>>> To allocate from the overall-list...
>>>   given a root pointer "X", follow to the first segment,
>>>   allocate chunk at "offset",
>>>   update "offset" with "next-offset",
>>>   if "offset" is zero, update "X" with "next-per-segment-list"
>>>
>>> Reserve a second word at the start of each segment and you have a back
>>> pointer to the previous segment.
>>> So when a segment is removed, its easy to remove all its
>>> Lilliputian-free-chunks out of the overall-list in one go.
>>> The Lilliputian-free-chunk-overall-list would be like a
>>> double-linked-list with offshoots.
>>>
>>> Maybe(?) then the Lilliputian never needs to be sorted,
>>> and also doesn't need to be traversed when freeing a segment, just
>>> remove the segment from the overall-list.
>>>
>>> cheers -ben
>>>
>>>
>>> P.S. Please feel free to play "idea"-whac-a-mole.   I've not put a lot
>>> of critical thinking into it.   It was easier to share than stop it
>>> bubbling about.
>>> https://en.wikipedia.org/wiki/Whac-A-Mole
>>>
>>
>>
>>

-- 
Clément Béra
https://clementbera.github.io/
https://clementbera.wordpress.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20181009/d4b2c654/attachment.html>