[Vm-dev] Re: [Pharo-project] Plan/discussion/communication around new object format

Wed May 30 20:53:04 UTC 2012

On Wed, May 30, 2012 at 10:22 PM, Eliot Miranda <eliot.miranda at gmail.com>wrote:

>
>
>
> On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse <
> stephane.ducasse at inria.fr> wrote:
>
>> I would like to be sure that we can have
>>        - bit for immutable objects
>>        - bits for experimenting.
>>
>
> There will be quite a few.  And one will be able to steal bits from the
> class field if one needs fewer classes.  I'm not absolutely sure of the
> layout yet.  But for example
>
> 8: slot size (255 => extra header word with large size)
> 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for
> pointer, 7 => # fixed fields is in the class)
> 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles indexable,
> compiled method, ephemerons, weak, etc)
> 1: immutability
> 3: GC 2 mark bits. 1 forwarded bit
> 20: identity hash
>

and we can make it lazy, that is, we compute it not at instantiation time
but rather the first time the primitive is call.

> 20: class index
>

This would probably work for a while. I think that it would be good to let
an "open door" so that in the future we can just add one more word for a
class pointer.

>
> still leaves 5 bits unused.  And stealing 4 bits each from class index
> still leaves 64k classes.  So this format is simple and provides lots of
> unused bits.  The format field is a great idea as it combines a number of
> orthogonal properties in very few bits.  I don't want to include odd bytes
> in format because I think a separate field that holds odd bytes and fixed
> fields is better use of space.  But we can gather statistics before
> deciding.
>
>
>> Stef
>>
>> On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
>>
>> > Hi guys
>> >
>> > Here is an important topic I would like to see discussed so that we see
>> how we can improve and join forces.
>> > May a mail discussion then a wiki for the summary would be good.
>> >
>> >
>> > stef
>> >
>> >
>> >
>> > Begin forwarded message:
>> >
>> >> From: Eliot Miranda <eliot.miranda at gmail.com>
>> >> Subject: Re: Plan/discussion/communication around new object format
>> >> Date: May 27, 2012 10:49:54 PM GMT+02:00
>> >> To: Stéphane Ducasse <stephane.ducasse at inria.fr>
>> >>
>> >>
>> >>
>> >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <
>> stephane.ducasse at inria.fr> wrote:
>> >> Hi eliot
>> >>
>> >> do you have a description of the new object format you want to
>> introduce?
>> >>
>> >> The design is in the class comment of CogMemoryManager in the Cog
>> VMMaker packages.
>> >>
>> >> Then what is your schedule?
>> >>
>> >> This is difficult. I have made a small start and should be able to
>> spend time on it starting soon.  I want to have it finished by early next
>> year.  But it depends on work schedules etc.
>> >>
>> >>
>> >> I would like to see if we can allocate igor/esteban time before we run
>> out of money
>> >> to help on that important topic.
>> >> Now the solution is unclear and I did not see any document where we
>> can evaluate
>> >> and plan how we can help. So do you want help on that topic? Then how
>> can people
>> >> contribute if any?
>> >>
>> >> The first thing to do is to read the design document, to see if the
>> Pharo community thinks it is the right direction, and to review it, spot
>> deficiencies etc.  So please get those interested to read the class comment
>> of CogMemoryManager in the latest VMMaker.oscog.
>> >>
>> >> Here's the current version of it:
>> >>
>> >> CogMemoryManager is currently a place-holder for the design of the new
>> Cog VM's object representation and garbage collector.  The goals for the GC
>> are
>> >>
>> >> - efficient object representation a la Eliot Miranda's VisualWorks
>> 64-bit object representation that uses a 64-bit header, eliminating direct
>> class references so that all objects refer to their classes indirectly.
>>  Instead the header contains a constant class index, in a field smaller
>> than a full pointer, These class indices are used in inline and first-level
>> method caches, hence they do not have to be updated on GC (although they do
>> have to be traced to be able to GC classes).  Classes are held in a sparse
>> weak table.  The class table needs only to be indexed by an instance's
>> class index in class hierarchy search, in the class primitive, and in
>> tracing live objects in the heap.  The additional header space is allocated
>> to a much expanded identity hash field, reducing hash efficiency problems
>> in identity collections due to the extremely small (11 bit) hash field in
>> the old Squeak GC.  The identity hash field is also a key element of the
>> class index scheme.  A class's identity hash is its index into the class
>> table, so to create an instance of a class one merely copies its identity
>> hash into the class index field of the new instance.  This implies that
>> when classes gain their identity hash they are entered into the class table
>> and their identity hash is that of a previously unused index in the table.
>>  It also implies that there is a maximum number of classes in the table.
>>  At least for a few years 64k classes should be enough.  A class is entered
>> into the class table in the following operations:
>> >>      behaviorHash
>> >>      adoptInstance
>> >>      instantiate
>> >>      become  (i.e. if an old class becomes a new class)
>> >>              if target class field's = to original's id hash
>> >>                 and replacement's id hash is zero
>> >>                      enter replacement in class table
>> >> behaviorHash is a special version of identityHash that must be
>> implemented in the image by any object that can function as a class (i.e.
>> Behavior).
>> >>
>> >> - more immediate classes.  An immediate Character class would speed up
>> String accessing, especially for WideString, since no instatiation needs to
>> be done on at:put: and no dereference need be done on at:.  In a 32-bit
>> system tag checking is complex since it is thought important to retain
>> 31-bit SmallIntegers.  Hence, as in current Squeak, the least significant
>> bit set implies a SmallInteger, but Characters would likely have a tag
>> pattern of xxx10.  Hence masking with 11 results in two values for
>> SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate
>> for Unicode.  In a 64-bit system we can use the full three bits and
>> usefully implement an immediate Float.  As in VisualWorks a functional
>> representation takes three bits away from the exponent.  Rotating to put
>> the sign bit in the least significant non-tag bit makes expanding and
>> contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad
>> makes comparing negative and positive zero easier (an immediate Float is
>> zero if its unsigned 64-bits are < 16).  So the representation looks like
>> >>      | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
>> >> For details see "60-bit immediate Floats" below.
>> >>
>> >>
>> >> - efficient scavenging.  The current Squeak GC uses a slow
>> pointer-reversal collector that writes every field in live objects three
>> times in each collection, twice in the pointer-reversing heap traversal to
>> mark live objects and once to update the pointer to its new location.  A
>> scavenger writes every field of live data twice in each collection, once as
>> it does a block copy of the object when copying to to space, once as it
>> traverses the live pointers in the to space objects.  Of course the block
>> copy is a relatively cheap write.
>> >>
>> >> - lazy become.  The JIT's use of inline cacheing provides a cheap way
>> of avoiding scanning the heap as part of a become (which is the simple
>> approach to implementing become in a system with direct pointers).  A
>> becomeForward: on a (set of) non-zero-sized object(s) turns the object into
>> a "corpse" or "forwarding object" whose first (non-header) word/slot is
>> replaced by a pointer to the target of the becomeForward:.  The corpse's
>> class index is set to one that identifies corpses and, because it is a
>> hidden class index, will always fail an inline cache test.  The inline
>> cache failure code is then responsible for following the forwarding pointer
>> chain (these are Iliffe vectors :) ) and resolving to the actual target.
>>  We have yet to determine exactly how this is done (e.g. change the
>> receiver register and/or stack contents and retry the send, perhaps
>> scanning the current activation).  See below on how we deal with becomes on
>> objects with named inst vars.  Note that we probably don't have to worry
>> about zero-sized objects.  These are unlikely to be passed through the FFI
>> (there is nothing to pass :) ) and so will rarely be becommed.  If they do,
>> they can become slowly.  Alternatively we can insist that objects are at
>> least 16 bytes in size (see a8-byte alignment below) so that there will
>> always be space for a forwarding pointer.  Since none of the immediate
>> classes can have non-immediate instances and since we allocate the
>> immediate classes indices corresponding to their tag pattern (SmallInteger
>> = 1, Character = 3, SmallFloat = 4?) we can use all the class indices from
>> 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized filler.
>> >>
>> >> - pinning.  To support a robust and easy-to-use FFI the memory manager
>> must support temporary pinning where individual objects can be prevented
>> from being moved by the GC for as long as required, either by being one of
>> an in-progress FFI call's arguments, or by having pinning asserted by a
>> primitive (allowing objects to be passed to external code that retains a
>> reference to the object after returning).  Pinning probably implies a
>> per-object "is-pinned" bit in the object header.  Pinning will be done via
>> lazy become; i..e an object in new space will be becommed into a pinned
>> object in old space.  We will only support pinning in old space.
>> >>
>> >> - efficient old space collection.  An incremental collector (a la
>> Dijkstra's three colour algorithm) collects old space, e.g. via an amount
>> of tracing being hung off scavenges and/or old space allocations at an
>> adaptive rate that keeps full garbage collections to a minimum.  (see free
>> space/free list below)
>> >>
>> >> - 8-byte alignment.  It is advantageous for the FFI, for
>> floating-point access, for object movement and for 32/64-bit
>> compatibility to keep object sizes in units of 8 bytes.  For the FFI,
>> 8-byte alignment means passing objects to code that expects that
>> requirement (such as modern x86 numeric processing instructions).  This
>> implies that
>> >>      - the starts of all spaces are aligned on 8-byte boundaries
>> >>      - object allocation rounds up the requested size to a multiple of
>> 8 bytes
>> >>      - the overflow size field is also 8 bytes
>> >> We shall probably keep the minimum object size at 16 bytes so that
>> there is always room for a forwarding pointer.  But this implies that we
>> will need to implement an 8-byte filler to fill holes between objects > 16
>> bytes whose length mod 16 bytes is 8 bytes and following pinned objects.
>>  We can do this using a special class index, e.g. 1, so that the method
>> that answers the size of an object looks like, e.g.
>> >>      chunkSizeOf: oop
>> >>              <var: #oop type: #'object *'>
>> >>              ^object classIndex = 1
>> >>                      ifTrue: [BaseHeaderSize]
>> >>                      ifFalse: [BaseHeaderSize
>> >>                                + (object slotSize = OverflowSlotSize
>> >>                                              ifTrue:
>> [OverflowSizeBytes]
>> >>                                              ifFalse: [0])
>> >>                                + (object slotSize * BytesPerSlot)]
>> >>
>> >>      chunkStartOf: oop
>> >>              <var: #oop type: #'object *'>
>> >>              ^(self cCoerceSimple: oop to: #'char *')
>> >>                      - ((object classIndex = 1
>> >>                          or: [object slotSize ~= OverflowSlotSize])
>> >>                                      ifTrue: [0]
>> >>                                      ifFalse: [OverflowSizeBytes])
>> >>
>> >> For the moment we do not tackle the issue of heap growth and shrinkage
>> with the ability to allocate and deallocate heap segments via
>> memory-mapping.  This technique allows space to be released back to the OS
>> by unmapping empty segments.  We may revisit this but it is not a key
>> requirement for the first implementation.
>> >>
>> >> The basic approach is to use a fixed size new space and a growable old
>> space.  The new space is a classic three-space nursery a la Ungar's
>> Generation Scavenging, a large eden for new objects and two smaller
>> survivor spaces that exchange roles on each collection, one being the to
>> space to which surviving objects are copied, the other being the from space
>> of the survivors of the previous collection, i.e. the previous to space.
>> >>
>> >> To provide apparent pinning in new space we rely on lazy become.
>>  Since most pinned objects will be byte data and these do not require stack
>> zone activation scanning, the overhead is simply an old space allocation
>> and corpsing.
>> >>
>> >> To provide pinning in old space, large objects are implicitly pinned
>> (because it is expensive to move large objects and, because they are both
>> large and relatively rare, they contribute little to overall fragmentation
>> - as in aggregates, small objects can be used to fill-in the spaces between
>> karge objects).  Hence, objects above a particular size are automatically
>> allocated in old space, rather than new space.  Small objects are pinned as
>> per objects in new space, by asserting the pin bit, which will be set
>> automaticaly when allocating a large object.  As a last resort, or by
>> programmer control (the fullGC primitive) old space is collected via
>> mark-sweep (mark-compact) and so the mark phase must build the list of
>> pinned objects around which the sweep/compact phase must carefully step.
>> >>
>> >> Free space in old space is organized by a free list/free tree as in
>> Eliot's VisualWorks 5i old space allocator.  There are 64 free lists,
>> indices 1 through 63 holding blocks of space of that size, index 0 holding
>> a semi-balanced ordered tree of free blocks, each node being the head of
>> the list of free blocks of that size.  At the start of the mark phase the
>> free list is thrown away and the sweep phase coallesces free space and
>> steps over pinned objects as it proceeds.  We can reuse the forwarding
>> pointer compaction scheme used in the old collector.  Incremental
>> collections merely move unmarked objects to the free lists (as well as
>> nilling weak pointers in weak arrays and scheduling them for finalization).
>>  The occupancy of the free lists is represented by a bitmap in a 64-bit
>> integer so that an allocation of size 63 or less can know whether there
>> exists a free chunk of that size, but more importantly can know whether a
>> free chunk larger than it exists in the fixed size free lists without
>> having to search all larger free list heads.
>> >>
>> >> The incremental collector (a la Dijkstra's three colour algorithm)
>> collects old space via an amount of tracing being hung off scavenges and/or
>> old space allocations at an adaptive rate that keeps full garbage
>> collections to a minimum.  [N.B. Not sure how to do this yet.  The
>> incremental collector needs to complete a pass often enough to reclaim
>> objects, but infrequent enough not to waste time.  So some form of feedback
>> should work.  In VisualWorks tracing is broken into quanta or work where
>> image-level code determines the size of a quantum based on how fast the
>> machine is, and how big the heap is.  This code could easily live in the
>> VM, controllable through vmParameterAt:put:.  An alternative would be to
>> use the heartbeat to bound quanta by time.  But in any case some amount of
>> incremental collection would be done on old space allocation and
>> scavenging, the ammount being chosen to keep pause times acceptably short,
>> and at a rate to reclaim old space before a full GC is required, i.e. at a
>> rate proportional to the growth in old space]. The incemental collector is
>> a state machine, being either marking, nilling weak pointers, or freeing.
>>  If nilling weak pointers is not done atomically then there must be a read
>> barrier in weak array at: so that reading from an old space weak array that
>> is holding stale un-nilled references to unmarked objects.  Tricks such as
>> including the weak bit in bounds calculations can make this cheap for
>> non-weak arrays.  Alternatively nilling weak pointers can be made an atomic
>> part of incremental collection, which can be made cheaper by maintaining
>> the set of weak arrays (e.g. on a list).
>> >>
>> >> The incremental collector implies a more complex write barrier.
>>  Objects are of three colours, black, having been scanned, grey, being
>> scanned, and white, unreached.  A mark stack holds the grey objects.   If
>> the incremental collector is marking and an unmarked white object is stored
>> into a black object then the stored object must become grey, being added to
>> the mark stack.  So the wrte barrier is essentially
>> >>      target isYoung ifFalse:
>> >>              [newValue isYoung
>> >>                      ifTrue: [target isInRememberedSet ifFalse:
>> >>                                      [target addToRememberedSet]]
>> "target now refers to a young object; it is a root for scavenges"
>> >>                      ifFalse:
>> >>                              [(target isBlack
>> >>                                and: [igc marking
>> >>                                and: [newValue isWhite]]) ifTrue:
>> >>                                      [newValue beGrey]]] "add newValue
>> to IGC's markStack for subsequent scanning"
>> >>
>> >> The incremental collector does not detect already marked objects all
>> of whose references have been overwritten by other stores (e.g. in the
>> above if newValue overwrites the sole remaining reference to a marked
>> object).  So the incremental collector only guarantees to collect all
>> garbage created in cycle N at the end of cycle N + 1.  The cost is hence
>> slightly worse memory density but the benefit, provided the IGC works hard
>> enough, is the elimination of long pauses due to full garbage collections,
>> which become actions of last resort or programmer desire.
>> >>
>> >> Lazy become.
>> >>
>> >> As described earlier the basic idea behind lazy become is to use
>> corpses (forwarding objects) that are followed lazily during GC and inline
>> cache miss.  However, a lazy scheme cannot be used on objects with named
>> inst vars without adding checking to all inst var accesses, which we judge
>> too expensive.  Instead, when becomming objects with named inst vars, we
>> scan all activations in the stack zone, eagerly becomming these references,
>> and we check for corpses when faulting in a context into the stack zone.
>>  Essentially, the invariant is that there are no references to corpses from
>> the receiver slots of stack activations.  A detail is whether we allow or
>> forbid pinning of closure indirection vectors, or scan the entire stack of
>> each activation.  Using a special class index pun for indirection vectors
>> is a cheap way of preventing their becomming/pinning etc.  Although "don't
>> do that" (don't attempt to pin/become indirection vectors) is also an
>> acceptable response.
>> >>
>> >> 60-bit immediate Floats
>> >> Representation for immediate doubles, only used in the 64-bit
>> implementation. Immediate doubles have the same 52 bit mantissa as IEEE
>> double-precision  floating-point, but only have 8 bits of exponent.  So
>> they occupy just less than the middle 1/8th of the double range.  They
>> overlap the normal single-precision floats which also have 8 bit exponents,
>> but exclude the single-precision denormals (exponent-127) and the
>> single-precsion NaNs (exponent +127).  +/- zero is just a pair of values
>> with both exponent and mantissa 0.
>> >> So the non-zero immediate doubles range from
>> >>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
>> >> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
>> >> The encoded tagged form has the sign bit moved to the least
>> significant bit, which allows for faster encode/decode because offsetting
>> the exponent can't overflow into the sign bit and because testing for +/- 0
>> is an unsigned compare for <= 0xf:
>> >>     msb
>>                          lsb
>> >>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits]
>> >> So assuming the tag is 5, the tagged non-zero bit patterns are
>> >>              0x0000,0000,0000,001[d/5]
>> >> to           0xffff,ffff,ffff,fff[d/5]
>> >> and +/- 0d is 0x0000,0000,0000,000[d/5]
>> >> Encode/decode of non-zero values in machine code looks like:
>> >>                                              msb
>>                        lsb
>> >> Decode:
>>  [8expsubset][52mantissa][1s][3tags]
>> >> shift away tags:                     [ 000
>> ][8expsubset][52mantissa][1s]
>> >> add exponent offset: [     11 exponent     ][52mantissa][1s]
>> >> rot sign:                            [1s][     11 exponent
>> ][52mantissa]
>> >>
>> >> Encode:                                      [1s][     11 exponent
>> ][52mantissa]
>> >> rot sign:                            [     11 exponent
>> ][52mantissa][1s]
>> >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
>> >> shift:                                       [8expsubset][52
>> mantissa][1s][ 000 ]
>> >> or/add tags:                 [8expsubset][52mantissa][1s][3tags]
>> >> but is slower in C because
>> >> a) there is no rotate, and
>> >> b) raw conversion between double and quadword must (at least in the
>> source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
>> >>
>> >> Issues:
>> >> How do we avoid the Size4Bit for 64-bits?  The format word encodes the
>> number of odd bytes, but currently has only 4 bits and hence only supports
>> odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the
>> separate Size4Bit.  Best to change the VI code and have a 5 bit format?  We
>> loose one bit but save two bits (isEphemeron and isWeak (or three, if
>> isPointers)) for a net gain of one (or two)
>> >>
>> >> Further, keep Squeak's format idea or go for separate bits?  For
>> 64-bits we need a 5 bit format field.  This contrasts with isPointers,
>> isWeak, isEphemeron, 3 odd size bits (or byte size)..  format field is
>> quite economical.
>> >>
>> >> Are class indices in inline caches strong references to classes or
>> weak references?
>> >> If strong then they must be scanned during GC and the methodZone must
>> be flushed on fullGC to reclaim all classes (this looks to be a bug in the
>> V3 Cogit).
>> >> If weak then when the class table loses references, PICs containing
>> freed classes must be freed and then sends to freed PICs or containing
>> freed clases must be unlinked.
>> >> The second approach is faster; the common case is scanning the class
>> table, the uncommon case is freeing classes.  The second approach is
>> better; in-line caches do not prevent reclamation of classes.
>> >>
>> >>
>> >> Stef
>> >>
>> >>
>> >>
>> >> --
>> >> best,
>> >> Eliot
>> >>
>> >
>>
>>
>>
>
>
> --
> best,
> Eliot
>
>
>

-- 
Mariano
http://marianopeck.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20120530/63bd3441/attachment-0001.htm