[Vm-dev] Re: [Pharo-project] Plan/discussion/communication around
new object format
Eliot Miranda
eliot.miranda at gmail.com
Wed May 30 21:10:43 UTC 2012
On Wed, May 30, 2012 at 1:57 PM, Stefan Marr <smalltalk at stefan-marr.de>wrote:
>
> Hi Eliot:
>
> From my experience with the RoarVM, it seems to be a rather simple
> exercise to enable the VM to support a custom 'pre-header' for objects.
> That is, a constant offset in the memory that comes from the allocator,
> and is normally ignored by the GC.
>
That's a great idea. So for experimental use one simply throws a whole
word at the problem and forgets about the issue. Thanks, Stefan. That
leaves me free to focus on something fast and compact that is still
flexible. Great!
>
> That allows me to do all kind of stuff. Of course at a cost of a word per
> object, and at the cost of recompiling the VM.
> But, that should be a reasonable price to pay for someone doing research
> on these kind of things.
>
> Sometimes a few bits are just not enough, and such a pre-header gives much
> much more flexibility.
> For the people interested in that, I could dig out the details (I think, I
> did that already once on this list).
>
> Best regards
> Stefan
>
>
> On 30 May 2012, at 22:22, Eliot Miranda wrote:
>
> >
> >
> > On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse <
> stephane.ducasse at inria.fr> wrote:
> > I would like to be sure that we can have
> > - bit for immutable objects
> > - bits for experimenting.
> >
> > There will be quite a few. And one will be able to steal bits from the
> class field if one needs fewer classes. I'm not absolutely sure of the
> layout yet. But for example
> >
> > 8: slot size (255 => extra header word with large size)
> > 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for
> pointer, 7 => # fixed fields is in the class)
> > 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles
> indexable, compiled method, ephemerons, weak, etc)
> > 1: immutability
> > 3: GC 2 mark bits. 1 forwarded bit
> > 20: identity hash
> > 20: class index
> >
> > still leaves 5 bits unused. And stealing 4 bits each from class index
> still leaves 64k classes. So this format is simple and provides lots of
> unused bits. The format field is a great idea as it combines a number of
> orthogonal properties in very few bits. I don't want to include odd bytes
> in format because I think a separate field that holds odd bytes and fixed
> fields is better use of space. But we can gather statistics before
> deciding.
> >
> >
> > Stef
> >
> > On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
> >
> > > Hi guys
> > >
> > > Here is an important topic I would like to see discussed so that we
> see how we can improve and join forces.
> > > May a mail discussion then a wiki for the summary would be good.
> > >
> > >
> > > stef
> > >
> > >
> > >
> > > Begin forwarded message:
> > >
> > >> From: Eliot Miranda <eliot.miranda at gmail.com>
> > >> Subject: Re: Plan/discussion/communication around new object format
> > >> Date: May 27, 2012 10:49:54 PM GMT+02:00
> > >> To: Stéphane Ducasse <stephane.ducasse at inria.fr>
> > >>
> > >>
> > >>
> > >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <
> stephane.ducasse at inria.fr> wrote:
> > >> Hi eliot
> > >>
> > >> do you have a description of the new object format you want to
> introduce?
> > >>
> > >> The design is in the class comment of CogMemoryManager in the Cog
> VMMaker packages.
> > >>
> > >> Then what is your schedule?
> > >>
> > >> This is difficult. I have made a small start and should be able to
> spend time on it starting soon. I want to have it finished by early next
> year. But it depends on work schedules etc.
> > >>
> > >>
> > >> I would like to see if we can allocate igor/esteban time before we
> run out of money
> > >> to help on that important topic.
> > >> Now the solution is unclear and I did not see any document where we
> can evaluate
> > >> and plan how we can help. So do you want help on that topic? Then how
> can people
> > >> contribute if any?
> > >>
> > >> The first thing to do is to read the design document, to see if the
> Pharo community thinks it is the right direction, and to review it, spot
> deficiencies etc. So please get those interested to read the class comment
> of CogMemoryManager in the latest VMMaker.oscog.
> > >>
> > >> Here's the current version of it:
> > >>
> > >> CogMemoryManager is currently a place-holder for the design of the
> new Cog VM's object representation and garbage collector. The goals for
> the GC are
> > >>
> > >> - efficient object representation a la Eliot Miranda's VisualWorks
> 64-bit object representation that uses a 64-bit header, eliminating direct
> class references so that all objects refer to their classes indirectly.
> Instead the header contains a constant class index, in a field smaller
> than a full pointer, These class indices are used in inline and first-level
> method caches, hence they do not have to be updated on GC (although they do
> have to be traced to be able to GC classes). Classes are held in a sparse
> weak table. The class table needs only to be indexed by an instance's
> class index in class hierarchy search, in the class primitive, and in
> tracing live objects in the heap. The additional header space is allocated
> to a much expanded identity hash field, reducing hash efficiency problems
> in identity collections due to the extremely small (11 bit) hash field in
> the old Squeak GC. The identity hash field is also a key element of the
> class index scheme. A class's identity hash is its index into the class
> table, so to create an instance of a class one merely copies its identity
> hash into the class index field of the new instance. This implies that
> when classes gain their identity hash they are entered into the class table
> and their identity hash is that of a previously unused index in the table.
> It also implies that there is a maximum number of classes in the table.
> At least for a few years 64k classes should be enough. A class is entered
> into the class table in the following operations:
> > >> behaviorHash
> > >> adoptInstance
> > >> instantiate
> > >> become (i.e. if an old class becomes a new class)
> > >> if target class field's = to original's id hash
> > >> and replacement's id hash is zero
> > >> enter replacement in class table
> > >> behaviorHash is a special version of identityHash that must be
> implemented in the image by any object that can function as a class (i.e.
> Behavior).
> > >>
> > >> - more immediate classes. An immediate Character class would speed
> up String accessing, especially for WideString, since no instatiation needs
> to be done on at:put: and no dereference need be done on at:. In a 32-bit
> system tag checking is complex since it is thought important to retain
> 31-bit SmallIntegers. Hence, as in current Squeak, the least significant
> bit set implies a SmallInteger, but Characters would likely have a tag
> pattern of xxx10. Hence masking with 11 results in two values for
> SmallInteger, xxx01 and xxx11. 30-bit characters are more than adequate
> for Unicode. In a 64-bit system we can use the full three bits and
> usefully implement an immediate Float. As in VisualWorks a functional
> representation takes three bits away from the exponent. Rotating to put
> the sign bit in the least significant non-tag bit makes expanding and
> contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad
> makes comparing negative and positive zero easier (an immediate Float is
> zero if its unsigned 64-bits are < 16). So the representation looks like
> > >> | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
> > >> For details see "60-bit immediate Floats" below.
> > >>
> > >>
> > >> - efficient scavenging. The current Squeak GC uses a slow
> pointer-reversal collector that writes every field in live objects three
> times in each collection, twice in the pointer-reversing heap traversal to
> mark live objects and once to update the pointer to its new location. A
> scavenger writes every field of live data twice in each collection, once as
> it does a block copy of the object when copying to to space, once as it
> traverses the live pointers in the to space objects. Of course the block
> copy is a relatively cheap write.
> > >>
> > >> - lazy become. The JIT's use of inline cacheing provides a cheap way
> of avoiding scanning the heap as part of a become (which is the simple
> approach to implementing become in a system with direct pointers). A
> becomeForward: on a (set of) non-zero-sized object(s) turns the object into
> a "corpse" or "forwarding object" whose first (non-header) word/slot is
> replaced by a pointer to the target of the becomeForward:. The corpse's
> class index is set to one that identifies corpses and, because it is a
> hidden class index, will always fail an inline cache test. The inline
> cache failure code is then responsible for following the forwarding pointer
> chain (these are Iliffe vectors :) ) and resolving to the actual target.
> We have yet to determine exactly how this is done (e.g. change the
> receiver register and/or stack contents and retry the send, perhaps
> scanning the current activation). See below on how we deal with becomes on
> objects with named inst vars. Note that we probably don't have to worry
> about zero-sized objects. These are unlikely to be passed through the FFI
> (there is nothing to pass :) ) and so will rarely be becommed. If they do,
> they can become slowly. Alternatively we can insist that objects are at
> least 16 bytes in size (see a8-byte alignment below) so that there will
> always be space for a forwarding pointer. Since none of the immediate
> classes can have non-immediate instances and since we allocate the
> immediate classes indices corresponding to their tag pattern (SmallInteger
> = 1, Character = 3, SmallFloat = 4?) we can use all the class indices from
> 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized filler.
> > >>
> > >> - pinning. To support a robust and easy-to-use FFI the memory
> manager must support temporary pinning where individual objects can be
> prevented from being moved by the GC for as long as required, either by
> being one of an in-progress FFI call's arguments, or by having pinning
> asserted by a primitive (allowing objects to be passed to external code
> that retains a reference to the object after returning). Pinning probably
> implies a per-object "is-pinned" bit in the object header. Pinning will be
> done via lazy become; i..e an object in new space will be becommed into a
> pinned object in old space. We will only support pinning in old space.
> > >>
> > >> - efficient old space collection. An incremental collector (a la
> Dijkstra's three colour algorithm) collects old space, e.g. via an amount
> of tracing being hung off scavenges and/or old space allocations at an
> adaptive rate that keeps full garbage collections to a minimum. (see free
> space/free list below)
> > >>
> > >> - 8-byte alignment. It is advantageous for the FFI, for
> floating-point access, for object movement and for 32/64-bit compatibility
> to keep object sizes in units of 8 bytes. For the FFI, 8-byte alignment
> means passing objects to code that expects that requirement (such as modern
> x86 numeric processing instructions). This implies that
> > >> - the starts of all spaces are aligned on 8-byte boundaries
> > >> - object allocation rounds up the requested size to a multiple
> of 8 bytes
> > >> - the overflow size field is also 8 bytes
> > >> We shall probably keep the minimum object size at 16 bytes so that
> there is always room for a forwarding pointer. But this implies that we
> will need to implement an 8-byte filler to fill holes between objects > 16
> bytes whose length mod 16 bytes is 8 bytes and following pinned objects.
> We can do this using a special class index, e.g. 1, so that the method
> that answers the size of an object looks like, e.g.
> > >> chunkSizeOf: oop
> > >> <var: #oop type: #'object *'>
> > >> ^object classIndex = 1
> > >> ifTrue: [BaseHeaderSize]
> > >> ifFalse: [BaseHeaderSize
> > >> + (object slotSize = OverflowSlotSize
> > >> ifTrue:
> [OverflowSizeBytes]
> > >> ifFalse: [0])
> > >> + (object slotSize * BytesPerSlot)]
> > >>
> > >> chunkStartOf: oop
> > >> <var: #oop type: #'object *'>
> > >> ^(self cCoerceSimple: oop to: #'char *')
> > >> - ((object classIndex = 1
> > >> or: [object slotSize ~= OverflowSlotSize])
> > >> ifTrue: [0]
> > >> ifFalse: [OverflowSizeBytes])
> > >>
> > >> For the moment we do not tackle the issue of heap growth and
> shrinkage with the ability to allocate and deallocate heap segments via
> memory-mapping. This technique allows space to be released back to the OS
> by unmapping empty segments. We may revisit this but it is not a key
> requirement for the first implementation.
> > >>
> > >> The basic approach is to use a fixed size new space and a growable
> old space. The new space is a classic three-space nursery a la Ungar's
> Generation Scavenging, a large eden for new objects and two smaller
> survivor spaces that exchange roles on each collection, one being the to
> space to which surviving objects are copied, the other being the from space
> of the survivors of the previous collection, i.e. the previous to space.
> > >>
> > >> To provide apparent pinning in new space we rely on lazy become.
> Since most pinned objects will be byte data and these do not require stack
> zone activation scanning, the overhead is simply an old space allocation
> and corpsing.
> > >>
> > >> To provide pinning in old space, large objects are implicitly pinned
> (because it is expensive to move large objects and, because they are both
> large and relatively rare, they contribute little to overall fragmentation
> - as in aggregates, small objects can be used to fill-in the spaces between
> karge objects). Hence, objects above a particular size are automatically
> allocated in old space, rather than new space. Small objects are pinned as
> per objects in new space, by asserting the pin bit, which will be set
> automaticaly when allocating a large object. As a last resort, or by
> programmer control (the fullGC primitive) old space is collected via
> mark-sweep (mark-compact) and so the mark phase must build the list of
> pinned objects around which the sweep/compact phase must carefully step.
> > >>
> > >> Free space in old space is organized by a free list/free tree as in
> Eliot's VisualWorks 5i old space allocator. There are 64 free lists,
> indices 1 through 63 holding blocks of space of that size, index 0 holding
> a semi-balanced ordered tree of free blocks, each node being the head of
> the list of free blocks of that size. At the start of the mark phase the
> free list is thrown away and the sweep phase coallesces free space and
> steps over pinned objects as it proceeds. We can reuse the forwarding
> pointer compaction scheme used in the old collector. Incremental
> collections merely move unmarked objects to the free lists (as well as
> nilling weak pointers in weak arrays and scheduling them for finalization).
> The occupancy of the free lists is represented by a bitmap in a 64-bit
> integer so that an allocation of size 63 or less can know whether there
> exists a free chunk of that size, but more importantly can know whether a
> free chunk larger than it exists in the fixed size free lists without
> having to search all larger free list heads.
> > >>
> > >> The incremental collector (a la Dijkstra's three colour algorithm)
> collects old space via an amount of tracing being hung off scavenges and/or
> old space allocations at an adaptive rate that keeps full garbage
> collections to a minimum. [N.B. Not sure how to do this yet. The
> incremental collector needs to complete a pass often enough to reclaim
> objects, but infrequent enough not to waste time. So some form of feedback
> should work. In VisualWorks tracing is broken into quanta or work where
> image-level code determines the size of a quantum based on how fast the
> machine is, and how big the heap is. This code could easily live in the
> VM, controllable through vmParameterAt:put:. An alternative would be to
> use the heartbeat to bound quanta by time. But in any case some amount of
> incremental collection would be done on old space allocation and
> scavenging, the ammount being chosen to keep pause times acceptably short,
> and at a rate to reclaim old space before a full GC is required, i.e. at a
> rate proportional to the growth in old space]. The incemental collector is
> a state machine, being either marking, nilling weak pointers, or freeing.
> If nilling weak pointers is not done atomically then there must be a read
> barrier in weak array at: so that reading from an old space weak array that
> is holding stale un-nilled references to unmarked objects. Tricks such as
> including the weak bit in bounds calculations can make this cheap for
> non-weak arrays. Alternatively nilling weak pointers can be made an atomic
> part of incremental collection, which can be made cheaper by maintaining
> the set of weak arrays (e.g. on a list).
> > >>
> > >> The incremental collector implies a more complex write barrier.
> Objects are of three colours, black, having been scanned, grey, being
> scanned, and white, unreached. A mark stack holds the grey objects. If
> the incremental collector is marking and an unmarked white object is stored
> into a black object then the stored object must become grey, being added to
> the mark stack. So the wrte barrier is essentially
> > >> target isYoung ifFalse:
> > >> [newValue isYoung
> > >> ifTrue: [target isInRememberedSet ifFalse:
> > >> [target addToRememberedSet]]
> "target now refers to a young object; it is a root for scavenges"
> > >> ifFalse:
> > >> [(target isBlack
> > >> and: [igc marking
> > >> and: [newValue isWhite]]) ifTrue:
> > >> [newValue beGrey]]] "add
> newValue to IGC's markStack for subsequent scanning"
> > >>
> > >> The incremental collector does not detect already marked objects all
> of whose references have been overwritten by other stores (e.g. in the
> above if newValue overwrites the sole remaining reference to a marked
> object). So the incremental collector only guarantees to collect all
> garbage created in cycle N at the end of cycle N + 1. The cost is hence
> slightly worse memory density but the benefit, provided the IGC works hard
> enough, is the elimination of long pauses due to full garbage collections,
> which become actions of last resort or programmer desire.
> > >>
> > >> Lazy become.
> > >>
> > >> As described earlier the basic idea behind lazy become is to use
> corpses (forwarding objects) that are followed lazily during GC and inline
> cache miss. However, a lazy scheme cannot be used on objects with named
> inst vars without adding checking to all inst var accesses, which we judge
> too expensive. Instead, when becomming objects with named inst vars, we
> scan all activations in the stack zone, eagerly becomming these references,
> and we check for corpses when faulting in a context into the stack zone.
> Essentially, the invariant is that there are no references to corpses from
> the receiver slots of stack activations. A detail is whether we allow or
> forbid pinning of closure indirection vectors, or scan the entire stack of
> each activation. Using a special class index pun for indirection vectors
> is a cheap way of preventing their becomming/pinning etc. Although "don't
> do that" (don't attempt to pin/become indirection vectors) is also an
> acceptable response.
> > >>
> > >> 60-bit immediate Floats
> > >> Representation for immediate doubles, only used in the 64-bit
> implementation. Immediate doubles have the same 52 bit mantissa as IEEE
> double-precision floating-point, but only have 8 bits of exponent. So
> they occupy just less than the middle 1/8th of the double range. They
> overlap the normal single-precision floats which also have 8 bit exponents,
> but exclude the single-precision denormals (exponent-127) and the
> single-precsion NaNs (exponent +127). +/- zero is just a pair of values
> with both exponent and mantissa 0.
> > >> So the non-zero immediate doubles range from
> > >> +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
> > >> to +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
> > >> The encoded tagged form has the sign bit moved to the least
> significant bit, which allows for faster encode/decode because offsetting
> the exponent can't overflow into the sign bit and because testing for +/- 0
> is an unsigned compare for <= 0xf:
> > >> msb
> lsb
> > >> [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag
> bits]
> > >> So assuming the tag is 5, the tagged non-zero bit patterns are
> > >> 0x0000,0000,0000,001[d/5]
> > >> to 0xffff,ffff,ffff,fff[d/5]
> > >> and +/- 0d is 0x0000,0000,0000,000[d/5]
> > >> Encode/decode of non-zero values in machine code looks like:
> > >> msb
> lsb
> > >> Decode:
> [8expsubset][52mantissa][1s][3tags]
> > >> shift away tags: [ 000
> ][8expsubset][52mantissa][1s]
> > >> add exponent offset: [ 11 exponent ][52mantissa][1s]
> > >> rot sign: [1s][ 11 exponent
> ][52mantissa]
> > >>
> > >> Encode: [1s][ 11 exponent
> ][52mantissa]
> > >> rot sign: [ 11 exponent
> ][52mantissa][1s]
> > >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
> > >> shift: [8expsubset][52
> mantissa][1s][ 000 ]
> > >> or/add tags: [8expsubset][52mantissa][1s][3tags]
> > >> but is slower in C because
> > >> a) there is no rotate, and
> > >> b) raw conversion between double and quadword must (at least in the
> source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
> > >>
> > >> Issues:
> > >> How do we avoid the Size4Bit for 64-bits? The format word encodes
> the number of odd bytes, but currently has only 4 bits and hence only
> supports odd bytes of 0 - 3. We need odd bytes of 0 - 7. But I don't like
> the separate Size4Bit. Best to change the VI code and have a 5 bit format?
> We loose one bit but save two bits (isEphemeron and isWeak (or three, if
> isPointers)) for a net gain of one (or two)
> > >>
> > >> Further, keep Squeak's format idea or go for separate bits? For
> 64-bits we need a 5 bit format field. This contrasts with isPointers,
> isWeak, isEphemeron, 3 odd size bits (or byte size).. format field is
> quite economical.
> > >>
> > >> Are class indices in inline caches strong references to classes or
> weak references?
> > >> If strong then they must be scanned during GC and the methodZone must
> be flushed on fullGC to reclaim all classes (this looks to be a bug in the
> V3 Cogit).
> > >> If weak then when the class table loses references, PICs containing
> freed classes must be freed and then sends to freed PICs or containing
> freed clases must be unlinked.
> > >> The second approach is faster; the common case is scanning the class
> table, the uncommon case is freeing classes. The second approach is
> better; in-line caches do not prevent reclamation of classes.
> > >>
> > >>
> > >> Stef
> > >>
> > >>
> > >>
> > >> --
> > >> best,
> > >> Eliot
> > >>
> > >
> >
> >
> >
> >
> >
> > --
> > best,
> > Eliot
> >
>
> --
> Stefan Marr
> Software Languages Lab
> Vrije Universiteit Brussel
> Pleinlaan 2 / B-1050 Brussels / Belgium
> http://soft.vub.ac.be/~smarr
> Phone: +32 2 629 2974
> Fax: +32 2 629 3525
>
>
--
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20120530/02dd3fb8/attachment-0001.htm
More information about the Vm-dev
mailing list