[Vm-dev] Re: [Pharo-project] Plan/discussion/communication around new object format

Stefan Marr smalltalk at stefan-marr.de
Wed May 30 20:57:49 UTC 2012


Hi Eliot:

From my experience with the RoarVM, it seems to be a rather simple exercise to enable the VM to support a custom 'pre-header' for objects.
That is, a constant offset in the memory that comes from the allocator, and is normally ignored by the GC.

That allows me to do all kind of stuff. Of course at a cost of a word per object, and at the cost of recompiling the VM.
But, that should be a reasonable price to pay for someone doing research on these kind of things.

Sometimes a few bits are just not enough, and such a pre-header gives much much more flexibility.
For the people interested in that, I could dig out the details (I think, I did that already once on this list).

Best regards
Stefan


On 30 May 2012, at 22:22, Eliot Miranda wrote:

> 
> 
> On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse <stephane.ducasse at inria.fr> wrote:
> I would like to be sure that we can have
>        - bit for immutable objects
>        - bits for experimenting.
> 
> There will be quite a few.  And one will be able to steal bits from the class field if one needs fewer classes.  I'm not absolutely sure of the layout yet.  But for example
> 
> 8: slot size (255 => extra header word with large size)
> 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for pointer, 7 => # fixed fields is in the class)
> 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles indexable, compiled method, ephemerons, weak, etc)
> 1: immutability
> 3: GC 2 mark bits. 1 forwarded bit
> 20: identity hash
> 20: class index
> 
> still leaves 5 bits unused.  And stealing 4 bits each from class index still leaves 64k classes.  So this format is simple and provides lots of unused bits.  The format field is a great idea as it combines a number of orthogonal properties in very few bits.  I don't want to include odd bytes in format because I think a separate field that holds odd bytes and fixed fields is better use of space.  But we can gather statistics before deciding.
> 
> 
> Stef
> 
> On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
> 
> > Hi guys
> >
> > Here is an important topic I would like to see discussed so that we see how we can improve and join forces.
> > May a mail discussion then a wiki for the summary would be good.
> >
> >
> > stef
> >
> >
> >
> > Begin forwarded message:
> >
> >> From: Eliot Miranda <eliot.miranda at gmail.com>
> >> Subject: Re: Plan/discussion/communication around new object format
> >> Date: May 27, 2012 10:49:54 PM GMT+02:00
> >> To: Stéphane Ducasse <stephane.ducasse at inria.fr>
> >>
> >>
> >>
> >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <stephane.ducasse at inria.fr> wrote:
> >> Hi eliot
> >>
> >> do you have a description of the new object format you want to introduce?
> >>
> >> The design is in the class comment of CogMemoryManager in the Cog VMMaker packages.
> >>
> >> Then what is your schedule?
> >>
> >> This is difficult. I have made a small start and should be able to spend time on it starting soon.  I want to have it finished by early next year.  But it depends on work schedules etc.
> >>
> >>
> >> I would like to see if we can allocate igor/esteban time before we run out of money
> >> to help on that important topic.
> >> Now the solution is unclear and I did not see any document where we can evaluate
> >> and plan how we can help. So do you want help on that topic? Then how can people
> >> contribute if any?
> >>
> >> The first thing to do is to read the design document, to see if the Pharo community thinks it is the right direction, and to review it, spot deficiencies etc.  So please get those interested to read the class comment of CogMemoryManager in the latest VMMaker.oscog.
> >>
> >> Here's the current version of it:
> >>
> >> CogMemoryManager is currently a place-holder for the design of the new Cog VM's object representation and garbage collector.  The goals for the GC are
> >>
> >> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit object representation that uses a 64-bit header, eliminating direct class references so that all objects refer to their classes indirectly.  Instead the header contains a constant class index, in a field smaller than a full pointer, These class indices are used in inline and first-level method caches, hence they do not have to be updated on GC (although they do have to be traced to be able to GC classes).  Classes are held in a sparse weak table.  The class table needs only to be indexed by an instance's class index in class hierarchy search, in the class primitive, and in tracing live objects in the heap.  The additional header space is allocated to a much expanded identity hash field, reducing hash efficiency problems in identity collections due to the extremely small (11 bit) hash field in the old Squeak GC.  The identity hash field is also a key element of the class index scheme.  A class's identity hash is its index into the class table, so to create an instance of a class one merely copies its identity hash into the class index field of the new instance.  This implies that when classes gain their identity hash they are entered into the class table and their identity hash is that of a previously unused index in the table.  It also implies that there is a maximum number of classes in the table.  At least for a few years 64k classes should be enough.  A class is entered into the class table in the following operations:
> >>      behaviorHash
> >>      adoptInstance
> >>      instantiate
> >>      become  (i.e. if an old class becomes a new class)
> >>              if target class field's = to original's id hash
> >>                 and replacement's id hash is zero
> >>                      enter replacement in class table
> >> behaviorHash is a special version of identityHash that must be implemented in the image by any object that can function as a class (i.e. Behavior).
> >>
> >> - more immediate classes.  An immediate Character class would speed up String accessing, especially for WideString, since no instatiation needs to be done on at:put: and no dereference need be done on at:.  In a 32-bit system tag checking is complex since it is thought important to retain 31-bit SmallIntegers.  Hence, as in current Squeak, the least significant bit set implies a SmallInteger, but Characters would likely have a tag pattern of xxx10.  Hence masking with 11 results in two values for SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate for Unicode.  In a 64-bit system we can use the full three bits and usefully implement an immediate Float.  As in VisualWorks a functional representation takes three bits away from the exponent.  Rotating to put the sign bit in the least significant non-tag bit makes expanding and contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad makes comparing negative and positive zero easier (an immediate Float is zero if its unsigned 64-bits are < 16).  So the representation looks like
> >>      | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
> >> For details see "60-bit immediate Floats" below.
> >>
> >>
> >> - efficient scavenging.  The current Squeak GC uses a slow pointer-reversal collector that writes every field in live objects three times in each collection, twice in the pointer-reversing heap traversal to mark live objects and once to update the pointer to its new location.  A scavenger writes every field of live data twice in each collection, once as it does a block copy of the object when copying to to space, once as it traverses the live pointers in the to space objects.  Of course the block copy is a relatively cheap write.
> >>
> >> - lazy become.  The JIT's use of inline cacheing provides a cheap way of avoiding scanning the heap as part of a become (which is the simple approach to implementing become in a system with direct pointers).  A becomeForward: on a (set of) non-zero-sized object(s) turns the object into a "corpse" or "forwarding object" whose first (non-header) word/slot is replaced by a pointer to the target of the becomeForward:.  The corpse's class index is set to one that identifies corpses and, because it is a hidden class index, will always fail an inline cache test.  The inline cache failure code is then responsible for following the forwarding pointer chain (these are Iliffe vectors :) ) and resolving to the actual target.  We have yet to determine exactly how this is done (e.g. change the receiver register and/or stack contents and retry the send, perhaps scanning the current activation).  See below on how we deal with becomes on objects with named inst vars.  Note that we probably don't have to worry about zero-sized objects.  These are unlikely to be passed through the FFI (there is nothing to pass :) ) and so will rarely be becommed.  If they do, they can become slowly.  Alternatively we can insist that objects are at least 16 bytes in size (see a8-byte alignment below) so that there will always be space for a forwarding pointer.  Since none of the immediate classes can have non-immediate instances and since we allocate the immediate classes indices corresponding to their tag pattern (SmallInteger = 1, Character = 3, SmallFloat = 4?) we can use all the class indices from 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized filler.
> >>
> >> - pinning.  To support a robust and easy-to-use FFI the memory manager must support temporary pinning where individual objects can be prevented from being moved by the GC for as long as required, either by being one of an in-progress FFI call's arguments, or by having pinning asserted by a primitive (allowing objects to be passed to external code that retains a reference to the object after returning).  Pinning probably implies a per-object "is-pinned" bit in the object header.  Pinning will be done via lazy become; i..e an object in new space will be becommed into a pinned object in old space.  We will only support pinning in old space.
> >>
> >> - efficient old space collection.  An incremental collector (a la Dijkstra's three colour algorithm) collects old space, e.g. via an amount of tracing being hung off scavenges and/or old space allocations at an adaptive rate that keeps full garbage collections to a minimum.  (see free space/free list below)
> >>
> >> - 8-byte alignment.  It is advantageous for the FFI, for floating-point access, for object movement and for 32/64-bit compatibility to keep object sizes in units of 8 bytes.  For the FFI, 8-byte alignment means passing objects to code that expects that requirement (such as modern x86 numeric processing instructions).  This implies that
> >>      - the starts of all spaces are aligned on 8-byte boundaries
> >>      - object allocation rounds up the requested size to a multiple of 8 bytes
> >>      - the overflow size field is also 8 bytes
> >> We shall probably keep the minimum object size at 16 bytes so that there is always room for a forwarding pointer.  But this implies that we will need to implement an 8-byte filler to fill holes between objects > 16 bytes whose length mod 16 bytes is 8 bytes and following pinned objects.  We can do this using a special class index, e.g. 1, so that the method that answers the size of an object looks like, e.g.
> >>      chunkSizeOf: oop
> >>              <var: #oop type: #'object *'>
> >>              ^object classIndex = 1
> >>                      ifTrue: [BaseHeaderSize]
> >>                      ifFalse: [BaseHeaderSize
> >>                                + (object slotSize = OverflowSlotSize
> >>                                              ifTrue: [OverflowSizeBytes]
> >>                                              ifFalse: [0])
> >>                                + (object slotSize * BytesPerSlot)]
> >>
> >>      chunkStartOf: oop
> >>              <var: #oop type: #'object *'>
> >>              ^(self cCoerceSimple: oop to: #'char *')
> >>                      - ((object classIndex = 1
> >>                          or: [object slotSize ~= OverflowSlotSize])
> >>                                      ifTrue: [0]
> >>                                      ifFalse: [OverflowSizeBytes])
> >>
> >> For the moment we do not tackle the issue of heap growth and shrinkage with the ability to allocate and deallocate heap segments via memory-mapping.  This technique allows space to be released back to the OS by unmapping empty segments.  We may revisit this but it is not a key requirement for the first implementation.
> >>
> >> The basic approach is to use a fixed size new space and a growable old space.  The new space is a classic three-space nursery a la Ungar's Generation Scavenging, a large eden for new objects and two smaller survivor spaces that exchange roles on each collection, one being the to space to which surviving objects are copied, the other being the from space of the survivors of the previous collection, i.e. the previous to space.
> >>
> >> To provide apparent pinning in new space we rely on lazy become.  Since most pinned objects will be byte data and these do not require stack zone activation scanning, the overhead is simply an old space allocation and corpsing.
> >>
> >> To provide pinning in old space, large objects are implicitly pinned (because it is expensive to move large objects and, because they are both large and relatively rare, they contribute little to overall fragmentation - as in aggregates, small objects can be used to fill-in the spaces between karge objects).  Hence, objects above a particular size are automatically allocated in old space, rather than new space.  Small objects are pinned as per objects in new space, by asserting the pin bit, which will be set automaticaly when allocating a large object.  As a last resort, or by programmer control (the fullGC primitive) old space is collected via mark-sweep (mark-compact) and so the mark phase must build the list of pinned objects around which the sweep/compact phase must carefully step.
> >>
> >> Free space in old space is organized by a free list/free tree as in Eliot's VisualWorks 5i old space allocator.  There are 64 free lists, indices 1 through 63 holding blocks of space of that size, index 0 holding a semi-balanced ordered tree of free blocks, each node being the head of the list of free blocks of that size.  At the start of the mark phase the free list is thrown away and the sweep phase coallesces free space and steps over pinned objects as it proceeds.  We can reuse the forwarding pointer compaction scheme used in the old collector.  Incremental collections merely move unmarked objects to the free lists (as well as nilling weak pointers in weak arrays and scheduling them for finalization).  The occupancy of the free lists is represented by a bitmap in a 64-bit integer so that an allocation of size 63 or less can know whether there exists a free chunk of that size, but more importantly can know whether a free chunk larger than it exists in the fixed size free lists without having to search all larger free list heads.
> >>
> >> The incremental collector (a la Dijkstra's three colour algorithm) collects old space via an amount of tracing being hung off scavenges and/or old space allocations at an adaptive rate that keeps full garbage collections to a minimum.  [N.B. Not sure how to do this yet.  The incremental collector needs to complete a pass often enough to reclaim objects, but infrequent enough not to waste time.  So some form of feedback should work.  In VisualWorks tracing is broken into quanta or work where image-level code determines the size of a quantum based on how fast the machine is, and how big the heap is.  This code could easily live in the VM, controllable through vmParameterAt:put:.  An alternative would be to use the heartbeat to bound quanta by time.  But in any case some amount of incremental collection would be done on old space allocation and scavenging, the ammount being chosen to keep pause times acceptably short, and at a rate to reclaim old space before a full GC is required, i.e. at a rate proportional to the growth in old space]. The incemental collector is a state machine, being either marking, nilling weak pointers, or freeing.  If nilling weak pointers is not done atomically then there must be a read barrier in weak array at: so that reading from an old space weak array that is holding stale un-nilled references to unmarked objects.  Tricks such as including the weak bit in bounds calculations can make this cheap for non-weak arrays.  Alternatively nilling weak pointers can be made an atomic part of incremental collection, which can be made cheaper by maintaining the set of weak arrays (e.g. on a list).
> >>
> >> The incremental collector implies a more complex write barrier.  Objects are of three colours, black, having been scanned, grey, being scanned, and white, unreached.  A mark stack holds the grey objects.   If the incremental collector is marking and an unmarked white object is stored into a black object then the stored object must become grey, being added to the mark stack.  So the wrte barrier is essentially
> >>      target isYoung ifFalse:
> >>              [newValue isYoung
> >>                      ifTrue: [target isInRememberedSet ifFalse:
> >>                                      [target addToRememberedSet]] "target now refers to a young object; it is a root for scavenges"
> >>                      ifFalse:
> >>                              [(target isBlack
> >>                                and: [igc marking
> >>                                and: [newValue isWhite]]) ifTrue:
> >>                                      [newValue beGrey]]] "add newValue to IGC's markStack for subsequent scanning"
> >>
> >> The incremental collector does not detect already marked objects all of whose references have been overwritten by other stores (e.g. in the above if newValue overwrites the sole remaining reference to a marked object).  So the incremental collector only guarantees to collect all garbage created in cycle N at the end of cycle N + 1.  The cost is hence slightly worse memory density but the benefit, provided the IGC works hard enough, is the elimination of long pauses due to full garbage collections, which become actions of last resort or programmer desire.
> >>
> >> Lazy become.
> >>
> >> As described earlier the basic idea behind lazy become is to use corpses (forwarding objects) that are followed lazily during GC and inline cache miss.  However, a lazy scheme cannot be used on objects with named inst vars without adding checking to all inst var accesses, which we judge too expensive.  Instead, when becomming objects with named inst vars, we scan all activations in the stack zone, eagerly becomming these references, and we check for corpses when faulting in a context into the stack zone.  Essentially, the invariant is that there are no references to corpses from the receiver slots of stack activations.  A detail is whether we allow or forbid pinning of closure indirection vectors, or scan the entire stack of each activation.  Using a special class index pun for indirection vectors is a cheap way of preventing their becomming/pinning etc.  Although "don't do that" (don't attempt to pin/become indirection vectors) is also an acceptable response.
> >>
> >> 60-bit immediate Floats
> >> Representation for immediate doubles, only used in the 64-bit implementation. Immediate doubles have the same 52 bit mantissa as IEEE double-precision  floating-point, but only have 8 bits of exponent.  So they occupy just less than the middle 1/8th of the double range.  They overlap the normal single-precision floats which also have 8 bit exponents, but exclude the single-precision denormals (exponent-127) and the single-precsion NaNs (exponent +127).  +/- zero is just a pair of values with both exponent and mantissa 0.
> >> So the non-zero immediate doubles range from
> >>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
> >> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
> >> The encoded tagged form has the sign bit moved to the least significant bit, which allows for faster encode/decode because offsetting the exponent can't overflow into the sign bit and because testing for +/- 0 is an unsigned compare for <= 0xf:
> >>     msb                                                                                        lsb
> >>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits]
> >> So assuming the tag is 5, the tagged non-zero bit patterns are
> >>              0x0000,0000,0000,001[d/5]
> >> to           0xffff,ffff,ffff,fff[d/5]
> >> and +/- 0d is 0x0000,0000,0000,000[d/5]
> >> Encode/decode of non-zero values in machine code looks like:
> >>                                              msb                                              lsb
> >> Decode:                              [8expsubset][52mantissa][1s][3tags]
> >> shift away tags:                     [ 000 ][8expsubset][52mantissa][1s]
> >> add exponent offset: [     11 exponent     ][52mantissa][1s]
> >> rot sign:                            [1s][     11 exponent     ][52mantissa]
> >>
> >> Encode:                                      [1s][     11 exponent     ][52mantissa]
> >> rot sign:                            [     11 exponent     ][52mantissa][1s]
> >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
> >> shift:                                       [8expsubset][52 mantissa][1s][ 000 ]
> >> or/add tags:                 [8expsubset][52mantissa][1s][3tags]
> >> but is slower in C because
> >> a) there is no rotate, and
> >> b) raw conversion between double and quadword must (at least in the source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
> >>
> >> Issues:
> >> How do we avoid the Size4Bit for 64-bits?  The format word encodes the number of odd bytes, but currently has only 4 bits and hence only supports odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the separate Size4Bit.  Best to change the VI code and have a 5 bit format?  We loose one bit but save two bits (isEphemeron and isWeak (or three, if isPointers)) for a net gain of one (or two)
> >>
> >> Further, keep Squeak's format idea or go for separate bits?  For 64-bits we need a 5 bit format field.  This contrasts with isPointers, isWeak, isEphemeron, 3 odd size bits (or byte size)..  format field is quite economical.
> >>
> >> Are class indices in inline caches strong references to classes or weak references?
> >> If strong then they must be scanned during GC and the methodZone must be flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 Cogit).
> >> If weak then when the class table loses references, PICs containing freed classes must be freed and then sends to freed PICs or containing freed clases must be unlinked.
> >> The second approach is faster; the common case is scanning the class table, the uncommon case is freeing classes.  The second approach is better; in-line caches do not prevent reclamation of classes.
> >>
> >>
> >> Stef
> >>
> >>
> >>
> >> --
> >> best,
> >> Eliot
> >>
> >
> 
> 
> 
> 
> 
> -- 
> best,
> Eliot
> 

-- 
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525



More information about the Vm-dev mailing list