[Vm-dev] Re: [Pharo-project] Plan/discussion/communication around new object format

Wed May 30 21:14:21 UTC 2012

On Wed, May 30, 2012 at 1:53 PM, Mariano Martinez Peck <
marianopeck at gmail.com> wrote:

>
>
>
> On Wed, May 30, 2012 at 10:22 PM, Eliot Miranda <eliot.miranda at gmail.com>wrote:
>
>>
>>
>>
>> On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse <
>> stephane.ducasse at inria.fr> wrote:
>>
>>> I would like to be sure that we can have
>>>        - bit for immutable objects
>>>        - bits for experimenting.
>>>
>>
>> There will be quite a few.  And one will be able to steal bits from the
>> class field if one needs fewer classes.  I'm not absolutely sure of the
>> layout yet.  But for example
>>
>> 8: slot size (255 => extra header word with large size)
>> 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for
>> pointer, 7 => # fixed fields is in the class)
>> 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles
>> indexable, compiled method, ephemerons, weak, etc)
>> 1: immutability
>> 3: GC 2 mark bits. 1 forwarded bit
>> 20: identity hash
>>
>
> and we can make it lazy, that is, we compute it not at instantiation time
> but rather the first time the primitive is call.
>
>
>
>> 20: class index
>>
>
> This would probably work for a while. I think that it would be good to let
> an "open door" so that in the future we can just add one more word for a
> class pointer.
>

Turns out that's not such a simple change.  Class indices have two
advantages.  One is that they're more compact (2^20 classes is still a lot
of classes).  The other is that they're constant, which has two main
benefits.  First, in method caches and in-line caches the class field holds
an index and hence doesn't need to be updated by the GC.  The GC no longer
has top visit send sites.  Second, because they're constants both checking
for well-known classes and instantiating well-known classes can be done
without going to the specialObjectsArray. One just uses the constant.  Now
undoing these optimizations to open a back-dorr is not trivial.  So best
accept the benefits and exploit them to a maximum.

>
>>
>> still leaves 5 bits unused.  And stealing 4 bits each from class index
>> still leaves 64k classes.  So this format is simple and provides lots of
>> unused bits.  The format field is a great idea as it combines a number of
>> orthogonal properties in very few bits.  I don't want to include odd bytes
>> in format because I think a separate field that holds odd bytes and fixed
>> fields is better use of space.  But we can gather statistics before
>> deciding.
>>
>>
>>> Stef
>>>
>>> On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
>>>
>>> > Hi guys
>>> >
>>> > Here is an important topic I would like to see discussed so that we
>>> see how we can improve and join forces.
>>> > May a mail discussion then a wiki for the summary would be good.
>>> >
>>> >
>>> > stef
>>> >
>>> >
>>> >
>>> > Begin forwarded message:
>>> >
>>> >> From: Eliot Miranda <eliot.miranda at gmail.com>
>>> >> Subject: Re: Plan/discussion/communication around new object format
>>> >> Date: May 27, 2012 10:49:54 PM GMT+02:00
>>> >> To: Stéphane Ducasse <stephane.ducasse at inria.fr>
>>> >>
>>> >>
>>> >>
>>> >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <
>>> stephane.ducasse at inria.fr> wrote:
>>> >> Hi eliot
>>> >>
>>> >> do you have a description of the new object format you want to
>>> introduce?
>>> >>
>>> >> The design is in the class comment of CogMemoryManager in the Cog
>>> VMMaker packages.
>>> >>
>>> >> Then what is your schedule?
>>> >>
>>> >> This is difficult. I have made a small start and should be able to
>>> spend time on it starting soon.  I want to have it finished by early next
>>> year.  But it depends on work schedules etc.
>>> >>
>>> >>
>>> >> I would like to see if we can allocate igor/esteban time before we
>>> run out of money
>>> >> to help on that important topic.
>>> >> Now the solution is unclear and I did not see any document where we
>>> can evaluate
>>> >> and plan how we can help. So do you want help on that topic? Then how
>>> can people
>>> >> contribute if any?
>>> >>
>>> >> The first thing to do is to read the design document, to see if the
>>> Pharo community thinks it is the right direction, and to review it, spot
>>> deficiencies etc.  So please get those interested to read the class comment
>>> of CogMemoryManager in the latest VMMaker.oscog.
>>> >>
>>> >> Here's the current version of it:
>>> >>
>>> >> CogMemoryManager is currently a place-holder for the design of the
>>> new Cog VM's object representation and garbage collector.  The goals for
>>> the GC are
>>> >>
>>> >> - efficient object representation a la Eliot Miranda's VisualWorks
>>> 64-bit object representation that uses a 64-bit header, eliminating direct
>>> class references so that all objects refer to their classes indirectly.
>>>  Instead the header contains a constant class index, in a field smaller
>>> than a full pointer, These class indices are used in inline and first-level
>>> method caches, hence they do not have to be updated on GC (although they do
>>> have to be traced to be able to GC classes).  Classes are held in a sparse
>>> weak table.  The class table needs only to be indexed by an instance's
>>> class index in class hierarchy search, in the class primitive, and in
>>> tracing live objects in the heap.  The additional header space is allocated
>>> to a much expanded identity hash field, reducing hash efficiency problems
>>> in identity collections due to the extremely small (11 bit) hash field in
>>> the old Squeak GC.  The identity hash field is also a key element of the
>>> class index scheme.  A class's identity hash is its index into the class
>>> table, so to create an instance of a class one merely copies its identity
>>> hash into the class index field of the new instance.  This implies that
>>> when classes gain their identity hash they are entered into the class table
>>> and their identity hash is that of a previously unused index in the table.
>>>  It also implies that there is a maximum number of classes in the table.
>>>  At least for a few years 64k classes should be enough.  A class is entered
>>> into the class table in the following operations:
>>> >>      behaviorHash
>>> >>      adoptInstance
>>> >>      instantiate
>>> >>      become  (i.e. if an old class becomes a new class)
>>> >>              if target class field's = to original's id hash
>>> >>                 and replacement's id hash is zero
>>> >>                      enter replacement in class table
>>> >> behaviorHash is a special version of identityHash that must be
>>> implemented in the image by any object that can function as a class (i.e.
>>> Behavior).
>>> >>
>>> >> - more immediate classes.  An immediate Character class would speed
>>> up String accessing, especially for WideString, since no instatiation needs
>>> to be done on at:put: and no dereference need be done on at:.  In a 32-bit
>>> system tag checking is complex since it is thought important to retain
>>> 31-bit SmallIntegers.  Hence, as in current Squeak, the least significant
>>> bit set implies a SmallInteger, but Characters would likely have a tag
>>> pattern of xxx10.  Hence masking with 11 results in two values for
>>> SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate
>>> for Unicode.  In a 64-bit system we can use the full three bits and
>>> usefully implement an immediate Float.  As in VisualWorks a functional
>>> representation takes three bits away from the exponent.  Rotating to put
>>> the sign bit in the least significant non-tag bit makes expanding and
>>> contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad
>>> makes comparing negative and positive zero easier (an immediate Float is
>>> zero if its unsigned 64-bits are < 16).  So the representation looks like
>>> >>      | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
>>> >> For details see "60-bit immediate Floats" below.
>>> >>
>>> >>
>>> >> - efficient scavenging.  The current Squeak GC uses a slow
>>> pointer-reversal collector that writes every field in live objects three
>>> times in each collection, twice in the pointer-reversing heap traversal to
>>> mark live objects and once to update the pointer to its new location.  A
>>> scavenger writes every field of live data twice in each collection, once as
>>> it does a block copy of the object when copying to to space, once as it
>>> traverses the live pointers in the to space objects.  Of course the block
>>> copy is a relatively cheap write.
>>> >>
>>> >> - lazy become.  The JIT's use of inline cacheing provides a cheap way
>>> of avoiding scanning the heap as part of a become (which is the simple
>>> approach to implementing become in a system with direct pointers).  A
>>> becomeForward: on a (set of) non-zero-sized object(s) turns the object into
>>> a "corpse" or "forwarding object" whose first (non-header) word/slot is
>>> replaced by a pointer to the target of the becomeForward:.  The corpse's
>>> class index is set to one that identifies corpses and, because it is a
>>> hidden class index, will always fail an inline cache test.  The inline
>>> cache failure code is then responsible for following the forwarding pointer
>>> chain (these are Iliffe vectors :) ) and resolving to the actual target.
>>>  We have yet to determine exactly how this is done (e.g. change the
>>> receiver register and/or stack contents and retry the send, perhaps
>>> scanning the current activation).  See below on how we deal with becomes on
>>> objects with named inst vars.  Note that we probably don't have to worry
>>> about zero-sized objects.  These are unlikely to be passed through the FFI
>>> (there is nothing to pass :) ) and so will rarely be becommed.  If they do,
>>> they can become slowly.  Alternatively we can insist that objects are at
>>> least 16 bytes in size (see a8-byte alignment below) so that there will
>>> always be space for a forwarding pointer.  Since none of the immediate
>>> classes can have non-immediate instances and since we allocate the
>>> immediate classes indices corresponding to their tag pattern (SmallInteger
>>> = 1, Character = 3, SmallFloat = 4?) we can use all the class indices from
>>> 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized filler.
>>> >>
>>> >> - pinning.  To support a robust and easy-to-use FFI the memory
>>> manager must support temporary pinning where individual objects can be
>>> prevented from being moved by the GC for as long as required, either by
>>> being one of an in-progress FFI call's arguments, or by having pinning
>>> asserted by a primitive (allowing objects to be passed to external code
>>> that retains a reference to the object after returning).  Pinning probably
>>> implies a per-object "is-pinned" bit in the object header.  Pinning will be
>>> done via lazy become; i..e an object in new space will be becommed into a
>>> pinned object in old space.  We will only support pinning in old space.
>>> >>
>>> >> - efficient old space collection.  An incremental collector (a la
>>> Dijkstra's three colour algorithm) collects old space, e.g. via an amount
>>> of tracing being hung off scavenges and/or old space allocations at an
>>> adaptive rate that keeps full garbage collections to a minimum.  (see free
>>> space/free list below)
>>> >>
>>> >> - 8-byte alignment.  It is advantageous for the FFI, for
>>> floating-point access, for object movement and for 32/64-bit
>>> compatibility to keep object sizes in units of 8 bytes.  For the FFI,
>>> 8-byte alignment means passing objects to code that expects that
>>> requirement (such as modern x86 numeric processing instructions).  This
>>> implies that
>>> >>      - the starts of all spaces are aligned on 8-byte boundaries
>>> >>      - object allocation rounds up the requested size to a multiple
>>> of 8 bytes
>>> >>      - the overflow size field is also 8 bytes
>>> >> We shall probably keep the minimum object size at 16 bytes so that
>>> there is always room for a forwarding pointer.  But this implies that we
>>> will need to implement an 8-byte filler to fill holes between objects > 16
>>> bytes whose length mod 16 bytes is 8 bytes and following pinned objects.
>>>  We can do this using a special class index, e.g. 1, so that the method
>>> that answers the size of an object looks like, e.g.
>>> >>      chunkSizeOf: oop
>>> >>              <var: #oop type: #'object *'>
>>> >>              ^object classIndex = 1
>>> >>                      ifTrue: [BaseHeaderSize]
>>> >>                      ifFalse: [BaseHeaderSize
>>> >>                                + (object slotSize = OverflowSlotSize
>>> >>                                              ifTrue:
>>> [OverflowSizeBytes]
>>> >>                                              ifFalse: [0])
>>> >>                                + (object slotSize * BytesPerSlot)]
>>> >>
>>> >>      chunkStartOf: oop
>>> >>              <var: #oop type: #'object *'>
>>> >>              ^(self cCoerceSimple: oop to: #'char *')
>>> >>                      - ((object classIndex = 1
>>> >>                          or: [object slotSize ~= OverflowSlotSize])
>>> >>                                      ifTrue: [0]
>>> >>                                      ifFalse: [OverflowSizeBytes])
>>> >>
>>> >> For the moment we do not tackle the issue of heap growth and
>>> shrinkage with the ability to allocate and deallocate heap segments via
>>> memory-mapping.  This technique allows space to be released back to the OS
>>> by unmapping empty segments.  We may revisit this but it is not a key
>>> requirement for the first implementation.
>>> >>
>>> >> The basic approach is to use a fixed size new space and a growable
>>> old space.  The new space is a classic three-space nursery a la Ungar's
>>> Generation Scavenging, a large eden for new objects and two smaller
>>> survivor spaces that exchange roles on each collection, one being the to
>>> space to which surviving objects are copied, the other being the from space
>>> of the survivors of the previous collection, i.e. the previous to space.
>>> >>
>>> >> To provide apparent pinning in new space we rely on lazy become.
>>>  Since most pinned objects will be byte data and these do not require stack
>>> zone activation scanning, the overhead is simply an old space allocation
>>> and corpsing.
>>> >>
>>> >> To provide pinning in old space, large objects are implicitly pinned
>>> (because it is expensive to move large objects and, because they are both
>>> large and relatively rare, they contribute little to overall fragmentation
>>> - as in aggregates, small objects can be used to fill-in the spaces between
>>> karge objects).  Hence, objects above a particular size are automatically
>>> allocated in old space, rather than new space.  Small objects are pinned as
>>> per objects in new space, by asserting the pin bit, which will be set
>>> automaticaly when allocating a large object.  As a last resort, or by
>>> programmer control (the fullGC primitive) old space is collected via
>>> mark-sweep (mark-compact) and so the mark phase must build the list of
>>> pinned objects around which the sweep/compact phase must carefully step.
>>> >>
>>> >> Free space in old space is organized by a free list/free tree as in
>>> Eliot's VisualWorks 5i old space allocator.  There are 64 free lists,
>>> indices 1 through 63 holding blocks of space of that size, index 0 holding
>>> a semi-balanced ordered tree of free blocks, each node being the head of
>>> the list of free blocks of that size.  At the start of the mark phase the
>>> free list is thrown away and the sweep phase coallesces free space and
>>> steps over pinned objects as it proceeds.  We can reuse the forwarding
>>> pointer compaction scheme used in the old collector.  Incremental
>>> collections merely move unmarked objects to the free lists (as well as
>>> nilling weak pointers in weak arrays and scheduling them for finalization).
>>>  The occupancy of the free lists is represented by a bitmap in a 64-bit
>>> integer so that an allocation of size 63 or less can know whether there
>>> exists a free chunk of that size, but more importantly can know whether a
>>> free chunk larger than it exists in the fixed size free lists without
>>> having to search all larger free list heads.
>>> >>
>>> >> The incremental collector (a la Dijkstra's three colour algorithm)
>>> collects old space via an amount of tracing being hung off scavenges and/or
>>> old space allocations at an adaptive rate that keeps full garbage
>>> collections to a minimum.  [N.B. Not sure how to do this yet.  The
>>> incremental collector needs to complete a pass often enough to reclaim
>>> objects, but infrequent enough not to waste time.  So some form of feedback
>>> should work.  In VisualWorks tracing is broken into quanta or work where
>>> image-level code determines the size of a quantum based on how fast the
>>> machine is, and how big the heap is.  This code could easily live in the
>>> VM, controllable through vmParameterAt:put:.  An alternative would be to
>>> use the heartbeat to bound quanta by time.  But in any case some amount of
>>> incremental collection would be done on old space allocation and
>>> scavenging, the ammount being chosen to keep pause times acceptably short,
>>> and at a rate to reclaim old space before a full GC is required, i.e. at a
>>> rate proportional to the growth in old space]. The incemental collector is
>>> a state machine, being either marking, nilling weak pointers, or freeing.
>>>  If nilling weak pointers is not done atomically then there must be a read
>>> barrier in weak array at: so that reading from an old space weak array that
>>> is holding stale un-nilled references to unmarked objects.  Tricks such as
>>> including the weak bit in bounds calculations can make this cheap for
>>> non-weak arrays.  Alternatively nilling weak pointers can be made an atomic
>>> part of incremental collection, which can be made cheaper by maintaining
>>> the set of weak arrays (e.g. on a list).
>>> >>
>>> >> The incremental collector implies a more complex write barrier.
>>>  Objects are of three colours, black, having been scanned, grey, being
>>> scanned, and white, unreached.  A mark stack holds the grey objects.   If
>>> the incremental collector is marking and an unmarked white object is stored
>>> into a black object then the stored object must become grey, being added to
>>> the mark stack.  So the wrte barrier is essentially
>>> >>      target isYoung ifFalse:
>>> >>              [newValue isYoung
>>> >>                      ifTrue: [target isInRememberedSet ifFalse:
>>> >>                                      [target addToRememberedSet]]
>>> "target now refers to a young object; it is a root for scavenges"
>>> >>                      ifFalse:
>>> >>                              [(target isBlack
>>> >>                                and: [igc marking
>>> >>                                and: [newValue isWhite]]) ifTrue:
>>> >>                                      [newValue beGrey]]] "add
>>> newValue to IGC's markStack for subsequent scanning"
>>> >>
>>> >> The incremental collector does not detect already marked objects all
>>> of whose references have been overwritten by other stores (e.g. in the
>>> above if newValue overwrites the sole remaining reference to a marked
>>> object).  So the incremental collector only guarantees to collect all
>>> garbage created in cycle N at the end of cycle N + 1.  The cost is hence
>>> slightly worse memory density but the benefit, provided the IGC works hard
>>> enough, is the elimination of long pauses due to full garbage collections,
>>> which become actions of last resort or programmer desire.
>>> >>
>>> >> Lazy become.
>>> >>
>>> >> As described earlier the basic idea behind lazy become is to use
>>> corpses (forwarding objects) that are followed lazily during GC and inline
>>> cache miss.  However, a lazy scheme cannot be used on objects with named
>>> inst vars without adding checking to all inst var accesses, which we judge
>>> too expensive.  Instead, when becomming objects with named inst vars, we
>>> scan all activations in the stack zone, eagerly becomming these references,
>>> and we check for corpses when faulting in a context into the stack zone.
>>>  Essentially, the invariant is that there are no references to corpses from
>>> the receiver slots of stack activations.  A detail is whether we allow or
>>> forbid pinning of closure indirection vectors, or scan the entire stack of
>>> each activation.  Using a special class index pun for indirection vectors
>>> is a cheap way of preventing their becomming/pinning etc.  Although "don't
>>> do that" (don't attempt to pin/become indirection vectors) is also an
>>> acceptable response.
>>> >>
>>> >> 60-bit immediate Floats
>>> >> Representation for immediate doubles, only used in the 64-bit
>>> implementation. Immediate doubles have the same 52 bit mantissa as IEEE
>>> double-precision  floating-point, but only have 8 bits of exponent.  So
>>> they occupy just less than the middle 1/8th of the double range.  They
>>> overlap the normal single-precision floats which also have 8 bit exponents,
>>> but exclude the single-precision denormals (exponent-127) and the
>>> single-precsion NaNs (exponent +127).  +/- zero is just a pair of values
>>> with both exponent and mantissa 0.
>>> >> So the non-zero immediate doubles range from
>>> >>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
>>> >> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
>>> >> The encoded tagged form has the sign bit moved to the least
>>> significant bit, which allows for faster encode/decode because offsetting
>>> the exponent can't overflow into the sign bit and because testing for +/- 0
>>> is an unsigned compare for <= 0xf:
>>> >>     msb
>>>                          lsb
>>> >>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag
>>> bits]
>>> >> So assuming the tag is 5, the tagged non-zero bit patterns are
>>> >>              0x0000,0000,0000,001[d/5]
>>> >> to           0xffff,ffff,ffff,fff[d/5]
>>> >> and +/- 0d is 0x0000,0000,0000,000[d/5]
>>> >> Encode/decode of non-zero values in machine code looks like:
>>> >>                                              msb
>>>                          lsb
>>> >> Decode:
>>>  [8expsubset][52mantissa][1s][3tags]
>>> >> shift away tags:                     [ 000
>>> ][8expsubset][52mantissa][1s]
>>> >> add exponent offset: [     11 exponent     ][52mantissa][1s]
>>> >> rot sign:                            [1s][     11 exponent
>>> ][52mantissa]
>>> >>
>>> >> Encode:                                      [1s][     11 exponent
>>>   ][52mantissa]
>>> >> rot sign:                            [     11 exponent
>>> ][52mantissa][1s]
>>> >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
>>> >> shift:                                       [8expsubset][52
>>> mantissa][1s][ 000 ]
>>> >> or/add tags:                 [8expsubset][52mantissa][1s][3tags]
>>> >> but is slower in C because
>>> >> a) there is no rotate, and
>>> >> b) raw conversion between double and quadword must (at least in the
>>> source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
>>> >>
>>> >> Issues:
>>> >> How do we avoid the Size4Bit for 64-bits?  The format word encodes
>>> the number of odd bytes, but currently has only 4 bits and hence only
>>> supports odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like
>>> the separate Size4Bit.  Best to change the VI code and have a 5 bit format?
>>>  We loose one bit but save two bits (isEphemeron and isWeak (or three, if
>>> isPointers)) for a net gain of one (or two)
>>> >>
>>> >> Further, keep Squeak's format idea or go for separate bits?  For
>>> 64-bits we need a 5 bit format field.  This contrasts with isPointers,
>>> isWeak, isEphemeron, 3 odd size bits (or byte size)..  format field is
>>> quite economical.
>>> >>
>>> >> Are class indices in inline caches strong references to classes or
>>> weak references?
>>> >> If strong then they must be scanned during GC and the methodZone must
>>> be flushed on fullGC to reclaim all classes (this looks to be a bug in the
>>> V3 Cogit).
>>> >> If weak then when the class table loses references, PICs containing
>>> freed classes must be freed and then sends to freed PICs or containing
>>> freed clases must be unlinked.
>>> >> The second approach is faster; the common case is scanning the class
>>> table, the uncommon case is freeing classes.  The second approach is
>>> better; in-line caches do not prevent reclamation of classes.
>>> >>
>>> >>
>>> >> Stef
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> best,
>>> >> Eliot
>>> >>
>>> >
>>>
>>>
>>>
>>
>>
>> --
>> best,
>> Eliot
>>
>>
>>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20120530/862e475e/attachment-0001.htm