[Vm-dev] rough idea for per segment Lilliputian lists

Tue Oct 9 16:26:53 UTC 2018

On Tue, 9 Oct 2018 at 15:52, Clément Béra <bera.clement at gmail.com> wrote:

> Solution:
>
> As Ben suggested, one solution would be to organized the lilliputian
> single linked list as a per segment linked list. I did not do that because
> it would impact a lot free chunk organization, I changed it in the past and
> it takes 2 days to make sure I did not break anything in the production VM.
>
> So what I did instead first is to try to sort lilliputian while sweeping
> (which is almost free
>

cool, thats good context. if fair to balance simplicity (i.e. reliability)
versus optimization.

I just need an extra variable and execution time is not really impacted),
> iterate over the lilliputian linked list while compacting segments, and
> since segments are in ascending addresses order, that meant I could iterate
> at most once over the lilliputian linked list while compacting (O(N^2)
> becomes O(N)). That reduced the stupid pathological cases from multiple
> seconds to multiple hundred ms.
>
> That was great, but still not enough (we target 16ms pause for 60fps
> apps). So in addition I annotated while sweeping each segment with their
> last lilliputian, which means that now I can iterate over the lilliputian
> linked list per segment and remove the remaining overhead, leading to
> compaction pauses of several ms.
>

I missed that pauses were that small already.

I still clear the lilliputians while iterating over the segment in
> compaction phase, clearing all the lilliputian of 1 segment at once is
> indeed now possible, but it has no real performance benefit IMO since it
> does not avoid any cache miss (prev lilliputian is in cache for the segment
> iteration and current one has just been read from memory). Current impl
> works so I've just kept it that way.
>
> Now what you suggest Ben is better, no sorting of lilliputian while
> sweeping, just keep a reference to the last lilliputian in the segment
> (first one found in the segment since they end up in reverse address
> order), and then clear all the segment lilliputians at once when compacting
> it. It was just easier to make it work the way I did from what was
> implemented when I did it.
>

cool. nice to know my understanding was on the right track and how other
practical matters affect it

I thought about doing something similar to what you described when I
> finished but I don't think it would impact that much performance (it does
> not remove read/writes with cache miss) so I was too lazy.
>
> Right now I have 2 development images, one matching the trunk and one with
> extra profiling changes for my paper, so it is difficult to change things.
> But I may do that later when I finish this paper or you can change it
> yourself.
>

I'm sure I'd find it interesting to do so, but I'm already behind on some
projects.

Compiling with selective requires to change compactorClass and enable
> tempvectorreadbarrier in slang compilation setting. The only thing is that
> u cannot really snapshot right now with selective (the idea is to have both
> selective and planning so that snapshots use planning and the rest of the
> code use selective, see hybridcompactor, not sure that works so far though,
> because without full heap compaction snapshots are quite bigger).
>
> Damned that was a long mail. But you really nailed the problem. I've been
> dying to talk about that to someone I'm glad you asked :-)
>

I'm glad I can prompt the creative process.

cheers -ben

> On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <btc at openinworld.com> wrote:
>
>>
>> On Mon, 8 Oct 2018 at 02:08, <commits at source.squeak.org> wrote:
>>
>>> ClementBera uploaded a new version of VMMaker to project VM Maker:
>>>
>>>   SpurSelectiveCompactor compacts memory by selecting the memory
>>> segments with the most free space and compacting only those, to limit
>>> fragmentation while being really quick to perform. The algorithm is fast
>>> mostly because it does not update pointers: they are updated lazily during
>>> the next marking phase, so there is no need to read the fields of objects
>>> in other memory segments that the one compacted.
>>>
>>>   The algorithm works as follow. First, a global sweep pass iterates
>>> over the memory linearly, changing unmarked objects to free chunks and
>>> concatenating free chunks. During the global sweep phase, the segments of
>>> the heap are analysed to determine the percentage of occupation. Second,
>>> the least occupied segments are compacted by copying the remaining live
>>> objects into an entirely free segment, called regionToFill (we detail later
>>> in the paragraph where regionToFill comes from), changing their values to
>>> forwarding objects and marking the free chunks as unavailable (removed from
>>> free list and marked as data objects). Third, the next marking phase
>>> removes all forwarders. Fourth, at the beginning of the next compaction
>>> phase the compacted segments from the previous GC can be entirely marked as
>>> free space (No need to check anything inside, there were only forwarders
>>> and trash data). One of the compacted segment is then selected as the
>>> segmentToFill, others are just marked as free chunks.
>>>
>>>
>>>   The compaction is effectively partial, compacting only the most
>>> critical segments of the heap to limit fragmentation. Compaction time is
>>> crazy low, since a low number of objects are moved and pointer updated is
>>> lazily done during the next marking phase, while still preventing memory
>>> fragmentation.
>>>
>>>   Now this works well when biasForGC is true, but when performing a
>>> snapshot, the compactor is just total crap (we need to figure out a
>>> solution).
>>>
>>>   Performance trick for lilliputian chunks:
>>> + Specific free chunks (called lilliputian, see isLilliputianSize:) are
>>> managed using a single linked list instead of a double linked list since
>>> there's not enough room in the free chunk for the back pointer. During the
>>> sweep phase this is not a problem since we're rebuilding the free chunk
>>> structure, but during selective compaction we're detaching free chunks from
>>> the free chunk structure and that can be horribly slow (10 seconds
>>> sometimes at 20Gb heap due to many iteration over the single linked list).
>>> To work around this problem, the sweep phase use lastLilliputianChunk
>>> variable to sort the lilliputian free chunk single linked list in ascending
>>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>>> selective compation phase, the same variable is re-used to iterate at most
>>> once over the single linked list while detaching lilliputian chunks (See
>>> incrementalUnlinkSmallChunk:). In addition, each segment is annotated
>>> during the sweep phase with the last lilliputian chunk it
>>>   holds. Hence, during the compaction phase, the linked list is iterated
>>> but the iteration can jump to the last chunk of the previous segment to
>>> compact.!
>>>
>>
>>
>>> - Specific free chunks (called lilliputian, see isLilliputianSize:) are
>>> managed using a single linked list instead of a double linked list since
>>> there's not enough room in the free chunk for the back pointer. During the
>>> sweep phase this is not a problem since we're rebuilding the free chunk
>>> structure, but during selective compaction we're detaching free chunks from
>>> the free chunk structure and that can be horribly slow (10 seconds
>>> sometimes at 20Gb heap due to many iteration over the single linked list).
>>> To work around this problem, the sweep phase use lastLilliputianChunk
>>> variable to sort the lilliputian free chunk single linked list in ascending
>>> address order (See interceptAddFreeChunkWithBytes:at:). During the
>>> selective compation phase, the same variable is re-used to iterate at most
>>> once over the single linked list while detaching lilliputian chunks (See
>>> incrementalUnlinkSmallChunk:).
>>>
>>
>> Please excuse me that I only skimmed the implementation and so don't have
>> a good grasp if what I'm saying is practical,
>> but anyway reading the above description a random idea floating along the
>> wind found me...
>>
>> Have you considered per-segment-Lilliputian-list?
>> Reserve the first word of each segment as the pointer to the segment's
>> first free-Lilliputian-chunk.  If it is null there are no
>> free-Lilliputian-chunks in that segment.
>> The global sweep easily(?) knows which segment a Lilliputian-free-chunk
>> belongs to and prepends to that per-segment list.
>>
>> To combine all those separate per-segment lists into an overall-list...
>> split that reserved word into two parts...
>> * the LSB part provides an "offset" to the first element of the
>> per-segment-list
>> * the MSB part provides a pointer to the next segment with a
>> per-segment-list i.e. "next-per-segment-list".
>> Also btw, Lilliputian-free-chunks only require space to hold the
>> "next-offset" within its segment
>>
>> To allocate from the overall-list...
>>   given a root pointer "X", follow to the first segment,
>>   allocate chunk at "offset",
>>   update "offset" with "next-offset",
>>   if "offset" is zero, update "X" with "next-per-segment-list"
>>
>> Reserve a second word at the start of each segment and you have a back
>> pointer to the previous segment.
>> So when a segment is removed, its easy to remove all its
>> Lilliputian-free-chunks out of the overall-list in one go.
>> The Lilliputian-free-chunk-overall-list would be like a
>> double-linked-list with offshoots.
>>
>> Maybe(?) then the Lilliputian never needs to be sorted,
>> and also doesn't need to be traversed when freeing a segment, just remove
>> the segment from the overall-list.
>>
>> cheers -ben
>>
>>
>> P.S. Please feel free to play "idea"-whac-a-mole.   I've not put a lot of
>> critical thinking into it.   It was easier to share than stop it bubbling
>> about.
>> https://en.wikipedia.org/wiki/Whac-A-Mole
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20181010/a165f64a/attachment-0001.html>