Spur Object 64 bit format

List overview All Threads
Download

newer

older

[commit][2807] look for...

Building VM on windows: lining...

Nicolas Cellier

10 Nov 2013 10 Nov '13

10:16 p.m.

The scheme is currently ...001 -> immediate int ...010 -> immediate char ...100 -> immediate double

But with 64 bits, 62 remaining for immediate char, that makes many char, well too many compared to Unicode requirements... What about swapping pattern for char and double? I know it complicates a little bit the 32 <-> 64 image translation, but this gains one more bit for immediate double exponent that is otherwise unused for char. Also, there is room for more immediate values, for example a single precision float. Is there any interest?

Attachments:

attachment.html (text/html — 685 bytes)

Show replies by date

Eliot Miranda

11 Nov 11 Nov

6:55 p.m.

On Sun, Nov 10, 2013 at 1:16 PM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:

...

The scheme is currently ...001 -> immediate int ...010 -> immediate char ...100 -> immediate double

But with 64 bits, 62 remaining for immediate char, that makes many char, well too many compared to Unicode requirements... What about swapping pattern for char and double? I know it complicates a little bit the 32 <-> 64 image translation, but this gains one more bit for immediate double exponent that is otherwise unused for char. Also, there is room for more immediate values, for example a single precision float. Is there any interest?

I think an extra bit for immediate double is a good idea. I'm not sure a single precision float immediate is useful given immediate doubles; am I missing something?

-- best, Eliot

Henrik Johansen

12 Nov 12 Nov

11:55 a.m.

On 10 Nov 2013, at 10:16 , Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:

...

The scheme is currently ...001 -> immediate int ...010 -> immediate char ...100 -> immediate double

But with 64 bits, 62 remaining for immediate char, that makes many char, well too many compared to Unicode requirements... What about swapping pattern for char and double? I know it complicates a little bit the 32 <-> 64 image translation, but this gains one more bit for immediate double exponent that is otherwise unused for char. Also, there is room for more immediate values, for example a single precision float. Is there any interest?

Sure I'm interested, I suggested approximately the same once: http://forum.world.st/Fwd-Plan-discussion-communication-around-new-object-fo...

Sadly, interest ~= time and: [skills] :/

Cheers, Henry

Henrik Johansen

11:56 a.m.

On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

...

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

Igor Stasenko

12:37 p.m.

And let me remind you what i proposed back then:

:) =========== About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types: - integers - floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

intervals, of course, cannot overlap. Determining a class of such immediate might be slower - O(log2(n)) at best (where n is size of range table), but from other side, how many different kinds of immediates you can fit into 60-bit value? Right, it is 2^60. Much more than proposed 8 isn't? :)

And this extra cost can be mitigated completely by inline cache. - in case of regular reference, you must fetch the object's class and then compare it with one, stored in cache. - in case of immediate reference, you compare immediate value with min and max stored in cache fields. And if value is in range, you got a cache hit, and free to proceed. So, its just 1 extra comparison comparing to 'classic' inline cache.

And, after thinking how inline cache is organized, now you can scratch the first my paragraph related to immediates! We really don't need to discriminate between small integers/floats/rest!! They could also be nothing more than just one of a range(s) defined in our zoo of 'special' immediates!

So, at the end we will have just two kinds of references: - zero bit == 0 -- an object pointer - zero bit == 1 -- an immediate

Voila!.

We can have real zoo of immediates, and simple implementation to support them. And not saying that range-table is provided by language-side, so we're free to rearrange them at any moment.

And of course, it doesn't means that VM could not reserve some of the ranges for own 'contracted' immediates, like Characters, and even class reference for example. Think about it :)

On 12 November 2013 11:56, Henrik Johansen henrik.s.johansen@veloxit.nowrote:

...

On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- Best regards, Igor Stasenko.

Nicolas Cellier

12:45 p.m.

I'm not sure I totally understand the post of Igor, since I was in the process of responding to Eliot, I need more time :)

Concerning single precision Float, they all can be represented losslessly in such immediate double (but Nan and Inf), so it might be better to let the VM as simple as possible like strongly suggested by Igor, and let experiments occur at image side.

Of course, I'm pretty convinced that Character MUST be in the immediate zoo, it's an obvious optimization given that so many algorithms suffers from human interaction and must deal with text at one moment or another :)

2013/11/12 Igor Stasenko siguctua@gmail.com

...

And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

intervals, of course, cannot overlap. Determining a class of such immediate might be slower - O(log2(n)) at best (where n is size of range table), but from other side, how many different kinds of immediates you can fit into 60-bit value? Right, it is 2^60. Much more than proposed 8 isn't? :)

And this extra cost can be mitigated completely by inline cache.

in case of regular reference, you must fetch the object's class and

then compare it with one, stored in cache.

in case of immediate reference, you compare immediate value with min

and max stored in cache fields. And if value is in range, you got a cache hit, and free to proceed. So, its just 1 extra comparison comparing to 'classic' inline cache.

And, after thinking how inline cache is organized, now you can scratch the first my paragraph related to immediates! We really don't need to discriminate between small integers/floats/rest!! They could also be nothing more than just one of a range(s) defined in our zoo of 'special' immediates!

So, at the end we will have just two kinds of references:

zero bit == 0 -- an object pointer

zero bit == 1 -- an immediate

Voila!.

We can have real zoo of immediates, and simple implementation to support them. And not saying that range-table is provided by language-side, so we're free to rearrange them at any moment.

And of course, it doesn't means that VM could not reserve some of the ranges for own 'contracted' immediates, like Characters, and even class reference for example. Think about it :)

On 12 November 2013 11:56, Henrik Johansen henrik.s.johansen@veloxit.nowrote:

...
On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- Best regards, Igor Stasenko.

Eliot Miranda

7:23 p.m.

Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.com wrote:

...

And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp: cmpl %ecx, %eax jnz Lmiss

This occurs at the start of every method. Its efficiency and size is really important for performance and compactness of machine code. I'm really not trying to be negative; it is simply a fact of life with a Smalltalk JIT.

intervals, of course, cannot overlap.

...

Determining a class of such immediate might be slower - O(log2(n)) at best (where n is size of range table), but from other side, how many different kinds of immediates you can fit into 60-bit value? Right, it is 2^60. Much more than proposed 8 isn't? :)

And this extra cost can be mitigated completely by inline cache.

in case of regular reference, you must fetch the object's class and

then compare it with one, stored in cache.

in case of immediate reference, you compare immediate value with min

and max stored in cache fields. And if value is in range, you got a cache hit, and free to proceed. So, its just 1 extra comparison comparing to 'classic' inline cache.

And, after thinking how inline cache is organized, now you can scratch the first my paragraph related to immediates! We really don't need to discriminate between small integers/floats/rest!! They could also be nothing more than just one of a range(s) defined in our zoo of 'special' immediates!

So, at the end we will have just two kinds of references:

zero bit == 0 -- an object pointer

zero bit == 1 -- an immediate

Voila!.

We can have real zoo of immediates, and simple implementation to support them. And not saying that range-table is provided by language-side, so we're free to rearrange them at any moment.

And of course, it doesn't means that VM could not reserve some of the ranges for own 'contracted' immediates, like Characters, and even class reference for example. Think about it :)

On 12 November 2013 11:56, Henrik Johansen henrik.s.johansen@veloxit.nowrote:

...
On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- Best regards, Igor Stasenko.

-- best, Eliot

Igor Stasenko

14 Nov 14 Nov

12:43 p.m.

On 12 November 2013 19:23, Eliot Miranda eliot.miranda@gmail.com wrote:

...

Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.com wrote:

...
And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp:

It is extra comparison for immediate case:

cmp eax, LowRange jl miss cmp eax, HighRange jg miss

so, if value between low & high range, it is already known what class it is otherwise you got cache miss.

And this scheme works even better if you pick highest bit for tagging, like that you can get rid of testing tag bit completely: andl $0x1, %eax j Lcmp and start straight from comparing with high and low values. That of course, if we can use highest bit for tagging.

The class lookup then will be binary search in table of ranges, which is O(log n).

...

This occurs at the start of every method. Its efficiency and size is really important for performance and compactness of machine code. I'm really not trying to be negative; it is simply a fact of life with a Smalltalk JIT.

intervals, of course, cannot overlap.

...
Determining a class of such immediate might be slower - O(log2(n)) at best (where n is size of range table), but from other side, how many different kinds of immediates you can fit into 60-bit value? Right, it is 2^60. Much more than proposed 8 isn't? :)

And this extra cost can be mitigated completely by inline cache.

in case of regular reference, you must fetch the object's class and

then compare it with one, stored in cache.

in case of immediate reference, you compare immediate value with min

and max stored in cache fields. And if value is in range, you got a cache hit, and free to proceed. So, its just 1 extra comparison comparing to 'classic' inline cache.

And, after thinking how inline cache is organized, now you can scratch the first my paragraph related to immediates! We really don't need to discriminate between small integers/floats/rest!! They could also be nothing more than just one of a range(s) defined in our zoo of 'special' immediates!

So, at the end we will have just two kinds of references:

zero bit == 0 -- an object pointer

zero bit == 1 -- an immediate

Voila!.

We can have real zoo of immediates, and simple implementation to support them. And not saying that range-table is provided by language-side, so we're free to rearrange them at any moment.

And of course, it doesn't means that VM could not reserve some of the ranges for own 'contracted' immediates, like Characters, and even class reference for example. Think about it :)

On 12 November 2013 11:56, Henrik Johansen henrik.s.johansen@veloxit.nowrote:

...
On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- Best regards, Igor Stasenko.

-- best, Eliot

-- Best regards, Igor Stasenko.

Eliot Miranda

5:44 p.m.

On Thu, Nov 14, 2013 at 3:43 AM, Igor Stasenko siguctua@gmail.com wrote:

...

On 12 November 2013 19:23, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.comwrote:

...
And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp:

It is extra comparison for immediate case:

cmp eax, LowRange jl miss cmp eax, HighRange jg miss

I don't understand. Does that mean there are two cache entries, one for HighRange one for LowRange? Spell it out please. Give the pseudo-code. Right now it is too vague for me to see how it is supposed to work.

...

so, if value between low & high range, it is already known what class it is otherwise you got cache miss.

And this scheme works even better if you pick highest bit for tagging, like that you can get rid of testing tag bit completely: andl $0x1, %eax j Lcmp and start straight from comparing with high and low values. That of course, if we can use highest bit for tagging.

The class lookup then will be binary search in table of ranges, which is O(log n).

So the inline cache becomes a range of entries? That will cause a significant code bloat.

...

...
This occurs at the start of every method. Its efficiency and size is really important for performance and compactness of machine code. I'm really not trying to be negative; it is simply a fact of life with a Smalltalk JIT.

intervals, of course, cannot overlap.

...
Determining a class of such immediate might be slower - O(log2(n)) at best (where n is size of range table), but from other side, how many different kinds of immediates you can fit into 60-bit value? Right, it is 2^60. Much more than proposed 8 isn't? :)

And this extra cost can be mitigated completely by inline cache.

in case of regular reference, you must fetch the object's class and

then compare it with one, stored in cache.

in case of immediate reference, you compare immediate value with min

and max stored in cache fields. And if value is in range, you got a cache hit, and free to proceed. So, its just 1 extra comparison comparing to 'classic' inline cache.

And, after thinking how inline cache is organized, now you can scratch the first my paragraph related to immediates! We really don't need to discriminate between small integers/floats/rest!! They could also be nothing more than just one of a range(s) defined in our zoo of 'special' immediates!

So, at the end we will have just two kinds of references:

zero bit == 0 -- an object pointer

zero bit == 1 -- an immediate

Voila!.

We can have real zoo of immediates, and simple implementation to support them. And not saying that range-table is provided by language-side, so we're free to rearrange them at any moment.

And of course, it doesn't means that VM could not reserve some of the ranges for own 'contracted' immediates, like Characters, and even class reference for example. Think about it :)

On 12 November 2013 11:56, Henrik Johansen <henrik.s.johansen@veloxit.no

...
wrote:

...
On 12 Nov 2013, at 11:55 , Henrik Johansen < henrik.s.johansen@veloxit.no> wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- Best regards, Igor Stasenko.

-- best, Eliot

-- Best regards, Igor Stasenko.

-- best, Eliot

Igor Stasenko

15 Nov 15 Nov

2:04 p.m.

On 14 November 2013 17:44, Eliot Miranda eliot.miranda@gmail.com wrote:

...

On Thu, Nov 14, 2013 at 3:43 AM, Igor Stasenko siguctua@gmail.com wrote:

...
On 12 November 2013 19:23, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.comwrote:

...
And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp:

It is extra comparison for immediate case:

cmp eax, LowRange jl miss cmp eax, HighRange jg miss

I don't understand. Does that mean there are two cache entries, one for HighRange one for LowRange? Spell it out please. Give the pseudo-code. Right now it is too vague for me to see how it is supposed to work.

Two entries where? If i understand, in COG you using generated code for inline cache, by jumping onto PIC entry for checking if it hits or misses. So, you just generate different code for immediate case(s), to check that receiver oop value is tagged and within certain range (cache hit) , or not (cache miss).

A pseudo-code for bit-pattern matching is this:

"presuming least significant bit is tag" (oop bitAnd: 2r101) == 2r101 ifTrue: [ hit ] ifFalse: [ miss ]

And code for range matching is this:

"presuming least significant bit is tag" (oop odd and: [ oop between: lowValue and: highValue]) ifTrue: [ hit ] ifFalse: [ miss ]

"or presuming most significant bit is tag" (oop between: lowValue and: highValue) ifTrue: [ hit ] ifFalse: [ miss ]

...

...
so, if value between low & high range, it is already known what class it is otherwise you got cache miss.

And this scheme works even better if you pick highest bit for tagging, like that you can get rid of testing tag bit completely: andl $0x1, %eax j Lcmp and start straight from comparing with high and low values. That of course, if we can use highest bit for tagging.

The class lookup then will be binary search in table of ranges, which is O(log n).

So the inline cache becomes a range of entries? That will cause a significant code bloat.

i'm not sure i understood, what that extra inline cache you talking about? presuming that each PIC is generated code, you just generate different code for entry which specialized to detect that given oop is of certain immediate class.

But if its not, then yes, you need to hold these two values somewhere to be able to compare them with input oop. But even in that case, i fail to see how single extra comparison could cause so much bloat.

And it always good to compare, how much bloat will cause bit pattern matching code.

What is clearly beneficial for immediate ranges that you can redefine them at any moment by introducing new kinds of immediates , without need to even touch VM: language can easily control that. And the potential number of immediate classes is much bigger , if you encode them in ranges, because:

2^31 (smallint) + 2^32 (short float) + 2^24 (character unicode) = 6459228160

which leaves us:

2^63 - 6459228160 = 9223372030395547648 space for other kinds of immediates, which we can introduce later !!without!! need to ever bother VM again.

-- Best regards, Igor Stasenko.

Eliot Miranda

6:26 p.m.

Hi Igor,

On Fri, Nov 15, 2013 at 5:04 AM, Igor Stasenko siguctua@gmail.com wrote:

...

On 14 November 2013 17:44, Eliot Miranda eliot.miranda@gmail.com wrote:

...
On Thu, Nov 14, 2013 at 3:43 AM, Igor Stasenko siguctua@gmail.comwrote:

...
On 12 November 2013 19:23, Eliot Miranda eliot.miranda@gmail.comwrote:

...
Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.comwrote:

...
And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp:

It is extra comparison for immediate case:

cmp eax, LowRange jl miss cmp eax, HighRange jg miss

I don't understand. Does that mean there are two cache entries, one for HighRange one for LowRange? Spell it out please. Give the pseudo-code. Right now it is too vague for me to see how it is supposed to work.

Two entries where? If i understand, in COG you using generated code for inline cache, by jumping onto PIC entry for checking if it hits or misses. So, you just generate different code for immediate case(s), to check that receiver oop value is tagged and within certain range (cache hit) , or not (cache miss).

That's not right. PICs are only used at send sites that prove to be polymorphic, i.e. become PICs after a send failure. At first they are simply monomorphic inline caches with a single entry. Only 9% of active send sites fail to be rebound to PICs.

...

A pseudo-code for bit-pattern matching is this:

"presuming least significant bit is tag" (oop bitAnd: 2r101) == 2r101 ifTrue: [ hit ] ifFalse: [ miss ]

And code for range matching is this:

"presuming least significant bit is tag" (oop odd and: [ oop between: lowValue and: highValue]) ifTrue: [ hit ] ifFalse: [ miss ]

"or presuming most significant bit is tag" (oop between: lowValue and: highValue) ifTrue: [ hit ] ifFalse: [ miss ]

and how are these two pieces of code related?

so, if value between low & high range, it is already known what class it

...

...
...
is otherwise you got cache miss.

And this scheme works even better if you pick highest bit for tagging, like that you can get rid of testing tag bit completely: andl $0x1, %eax j Lcmp and start straight from comparing with high and low values. That of course, if we can use highest bit for tagging.

The class lookup then will be binary search in table of ranges, which is O(log n).

So the inline cache becomes a range of entries? That will cause a significant code bloat.

i'm not sure i understood, what that extra inline cache you talking about? presuming that each PIC is generated code, you just generate different code for entry which specialized to detect that given oop is of certain immediate class.

But if its not, then yes, you need to hold these two values somewhere to be able to compare them with input oop. But even in that case, i fail to see how single extra comparison could cause so much bloat.

And it always good to compare, how much bloat will cause bit pattern matching code.

What is clearly beneficial for immediate ranges that you can redefine them at any moment by introducing new kinds of immediates , without need to even touch VM: language can easily control that. And the potential number of immediate classes is much bigger , if you encode them in ranges, because:

2^31 (smallint) + 2^32 (short float) + 2^24 (character unicode) = 6459228160

which leaves us:

2^63 - 6459228160 = 9223372030395547648 space for other kinds of immediates, which we can introduce later !!without!! need to ever bother VM again.

-- Best regards, Igor Stasenko.

-- best, Eliot

Igor Stasenko

20 Nov 20 Nov

3:31 p.m.

On 15 November 2013 18:26, Eliot Miranda eliot.miranda@gmail.com wrote:

...

Hi Igor,

On Fri, Nov 15, 2013 at 5:04 AM, Igor Stasenko siguctua@gmail.com wrote:

...
On 14 November 2013 17:44, Eliot Miranda eliot.miranda@gmail.com wrote:

...
On Thu, Nov 14, 2013 at 3:43 AM, Igor Stasenko siguctua@gmail.comwrote:

...
On 12 November 2013 19:23, Eliot Miranda eliot.miranda@gmail.comwrote:

...
Hi Igor,

On Tue, Nov 12, 2013 at 3:37 AM, Igor Stasenko siguctua@gmail.comwrote:

...
And let me remind you what i proposed back then:

:)

About immediates zoo.

Keep in mind, that the more immediates we have, the more complex implementation tends to be.

I would just keep 2 data types:

integers

floats

and third, special 'arbitrary' immediate , which seen by VM as a 60-bit value. The interpretation of this value depends on lookup in range-table, where developer specifying the correspondence between the value interval and class: [min .. max] -> class

for this idea to go anywhere you'd have to show at least the pseudo-code for the inline cache test in machine code methods. These schemes seem great in theory but in practice end up with a long, complex and slow fetchClassOf: and/or inline cache test. To remind you, you have to compete with the following in Spur Cog:

Limm: andl $0x1, %eax j Lcmp Lentry: movl %edx, %eax andl $0x3, %eax jnz Limm movl 0(%edx), %eax andl $0x3fffff, %eax Lcmp:

It is extra comparison for immediate case:

cmp eax, LowRange jl miss cmp eax, HighRange jg miss

I don't understand. Does that mean there are two cache entries, one for HighRange one for LowRange? Spell it out please. Give the pseudo-code. Right now it is too vague for me to see how it is supposed to work.

Two entries where? If i understand, in COG you using generated code for inline cache, by jumping onto PIC entry for checking if it hits or misses. So, you just generate different code for immediate case(s), to check that receiver oop value is tagged and within certain range (cache hit) , or not (cache miss).

That's not right. PICs are only used at send sites that prove to be polymorphic, i.e. become PICs after a send failure. At first they are simply monomorphic inline caches with a single entry. Only 9% of active send sites fail to be rebound to PICs.

oh right , i was speaking about monomorphic cache. yeah.. we discussed this with Clement today.. and while it is good to have virtually unlimited number of immediate classes.. in reality we can barely find 10 cases which worth turning into immediate objects. Maybe one day i will try to run the experiment with it.. but for Spur, i think it not worth to do. I remember why i offered this model is because i feel a bit of a loss, when people were discussing how much bits to reserve for immediates tag (3 or 4) and then how much immediate types you got, and then its like a lottery to pick a lucky ones which will become immediate ones (while unlucky poor ones will be left behind ;)

...

...
A pseudo-code for bit-pattern matching is this:

"presuming least significant bit is tag" (oop bitAnd: 2r101) == 2r101 ifTrue: [ hit ] ifFalse: [ miss ]

And code for range matching is this:

"presuming least significant bit is tag" (oop odd and: [ oop between: lowValue and: highValue]) ifTrue: [ hit ] ifFalse: [ miss ]

"or presuming most significant bit is tag" (oop between: lowValue and: highValue) ifTrue: [ hit ] ifFalse: [ miss ]

and how are these two pieces of code related?

so, if value between low & high range, it is already known what class

...
...
...
it is otherwise you got cache miss.

And this scheme works even better if you pick highest bit for tagging, like that you can get rid of testing tag bit completely: andl $0x1, %eax j Lcmp and start straight from comparing with high and low values. That of course, if we can use highest bit for tagging.

The class lookup then will be binary search in table of ranges, which is O(log n).

So the inline cache becomes a range of entries? That will cause a significant code bloat.

i'm not sure i understood, what that extra inline cache you talking about? presuming that each PIC is generated code, you just generate different code for entry which specialized to detect that given oop is of certain immediate class.

But if its not, then yes, you need to hold these two values somewhere to be able to compare them with input oop. But even in that case, i fail to see how single extra comparison could cause so much bloat.

And it always good to compare, how much bloat will cause bit pattern matching code.

What is clearly beneficial for immediate ranges that you can redefine them at any moment by introducing new kinds of immediates , without need to even touch VM: language can easily control that. And the potential number of immediate classes is much bigger , if you encode them in ranges, because:

2^31 (smallint) + 2^32 (short float) + 2^24 (character unicode) = 6459228160

which leaves us:

2^63 - 6459228160 = 9223372030395547648 space for other kinds of immediates, which we can introduce later !!without!! need to ever bother VM again.

-- Best regards, Igor Stasenko.

-- best, Eliot

-- Best regards, Igor Stasenko.

Clément Bera

12 Nov 12 Nov

12:56 p.m.

Hey,

By the way as you are talking about Immediate objects tags and Spur's object representation, I read on other VMs reports that it is faster in overall performance to have pointers tagged with the xxx1 and SmallIntegers tagged with xxx0, because SmallIntegers arithmetics is then much faster. Is it possible to modify that easily in Cog to try ? Do you have to edit only a few methods such as #longAt:, #integerValueOf: and SmallInteger primitives ?

Another thing, I read in a java VM report that only 1% of objects needs a hash (other objects hash are unused), so their hash implementation took only 2 bits in the header, being: 01 -> no hash 10 -> hash is the object's address 11 -> hash is in a header extension Therefore, on first access to the hash, hash bits switches from 01 to 10, and the GC on objects with 10 switches the hash bits from 10 to 11 and add a header extension with the old address of the object. But I know I'm not teaching you anything there. What I would like to know is why did you choose in Spur to have a 22bits hash in the header over this implementation, and will the hash in Spur be calculated at each instantiation or lazily ?

Thanks for any answers,

Regards, Clement

2013/11/12 Henrik Johansen henrik.s.johansen@veloxit.no

...

On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

Eliot Miranda

7:24 p.m.

On Tue, Nov 12, 2013 at 3:56 AM, Clément Bera bera.clement@gmail.comwrote:

...

Hey,

By the way as you are talking about Immediate objects tags and Spur's object representation, I read on other VMs reports that it is faster in overall performance to have pointers tagged with the xxx1 and SmallIntegers tagged with xxx0, because SmallIntegers arithmetics is then much faster. Is it possible to modify that easily in Cog to try ? Do you have to edit only a few methods such as #longAt:, #integerValueOf: and SmallInteger primitives ?

Another thing, I read in a java VM report that only 1% of objects needs a hash (other objects hash are unused), so their hash implementation took only 2 bits in the header, being: 01 -> no hash 10 -> hash is the object's address 11 -> hash is in a header extension Therefore, on first access to the hash, hash bits switches from 01 to 10, and the GC on objects with 10 switches the hash bits from 10 to 11 and add a header extension with the old address of the object. But I know I'm not teaching you anything there. What I would like to know is why did you choose in Spur to have a 22bits hash in the header over this implementation, and will the hash in Spur be calculated at each instantiation or lazily ?

Thanks for any answers,

Regards, Clement

2013/11/12 Henrik Johansen henrik.s.johansen@veloxit.no

...
On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- best, Eliot

Eliot Miranda

7:33 p.m.

Hi Clément,

On Tue, Nov 12, 2013 at 3:56 AM, Clément Bera bera.clement@gmail.comwrote:

...

Hey,

By the way as you are talking about Immediate objects tags and Spur's object representation, I read on other VMs reports that it is faster in overall performance to have pointers tagged with the xxx1 and SmallIntegers tagged with xxx0, because SmallIntegers arithmetics is then much faster.

I'm sure that was true 20 years ago when register operations were about the same speed as memory operations. But now that memory operations are at least an order of magnitude slower I suspect that adjusting the tags (all register operations) are in the noise. But I agree it *would* be interesting to measure.

Is it possible to modify that easily in Cog to try ? Do you have to edit

...

only a few methods such as #longAt:, #integerValueOf: and SmallInteger primitives ?

You'd have to modify the isImmediate:/isNonImmediate:, isIntegerObject:, isNonIntegerObject:, and storePointer:ofObject:withValue:/fetchPointer:ofObject: methods, and then the code generated by the CogObjectRepresentation subclass, and modify the bootstrap to produce an image with the new tagging. The VM is factored so that this isn't horribly difficult. But it's work. Perhaps it would be a good student project?

...

Another thing, I read in a java VM report that only 1% of objects needs a hash (other objects hash are unused), so their hash implementation took only 2 bits in the header, being: 01 -> no hash 10 -> hash is the object's address 11 -> hash is in a header extension Therefore, on first access to the hash, hash bits switches from 01 to 10, and the GC on objects with 10 switches the hash bits from 10 to 11 and add a header extension with the old address of the object. But I know I'm not teaching you anything there.

That's clever. Yes you are :-). Thanks.

...

What I would like to know is why did you choose in Spur to have a 22bits hash in the header over this implementation, and will the hash in Spur be calculated at each instantiation or lazily ?

I chose the 22 bit hash scheme because it is part of the idea of having class indices. Notice that an object's class index is 22 bits also. The point here is that a class's hash is its class index. So to instantiate a class one copies the class's hash into the instance's class index, avoiding searching the class table to find the class's index. Perhaps this isn't the most optimal approach but it is coherent, and worked well for 64-bit VisualWorks.

The hash calculation is lazy. The hash field is 0 when instantiated. Hence

Behavior>>identityHash "Answer a SmallInteger whose value is related to the receiver's identity. This method must not be overridden. Primitive. Fails if the receiver is not a Behavior. Essential. See Object documentation whatIsAPrimitive.

Do not override."

<primitive: 175> self primitiveFailed

InterpreterPrimitives>>primitiveBehaviorHash | hashOrError | self assert: ((objectMemory isNonImmediate: self stackTop) and: [self addressCouldBeClassObj: self stackTop]). hashOrError := objectMemory ensureBehaviorHash: self stackTop. hashOrError >= 0 ifTrue: [self pop: argumentCount + 1 thenPushInteger: hashOrError] ifFalse: [self primitiveFailFor: hashOrError negated]

SpurMemoryManager>>ensureBehaviorHash: aBehavior | newHash err | <inline: true> self assert: (coInterpreter addressCouldBeClassObj: aBehavior). (newHash := self rawHashBitsOf: aBehavior) = 0 ifTrue: [(err := self enterIntoClassTable: aBehavior) ~= 0 ifTrue: [^err negated]. newHash := self rawHashBitsOf: aBehavior. self assert: (self classAtIndex: newHash) = aBehavior]. ^newHash

...

Thanks for any answers,

Regards, Clement

cheers!

...

2013/11/12 Henrik Johansen henrik.s.johansen@veloxit.no

...
On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- best, Eliot

stephane ducasse

14 Nov 14 Nov

12:26 p.m.

Thanks for the answer.

Clement I love your questions :) Stef

On Nov 12, 2013, at 7:33 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...

Hi Clément,

On Tue, Nov 12, 2013 at 3:56 AM, Clément Bera bera.clement@gmail.com wrote:

Hey,

By the way as you are talking about Immediate objects tags and Spur's object representation, I read on other VMs reports that it is faster in overall performance to have pointers tagged with the xxx1 and SmallIntegers tagged with xxx0, because SmallIntegers arithmetics is then much faster.

I'm sure that was true 20 years ago when register operations were about the same speed as memory operations. But now that memory operations are at least an order of magnitude slower I suspect that adjusting the tags (all register operations) are in the noise. But I agree it *would* be interesting to measure.

Is it possible to modify that easily in Cog to try ? Do you have to edit only a few methods such as #longAt:, #integerValueOf: and SmallInteger primitives ?

You'd have to modify the isImmediate:/isNonImmediate:, isIntegerObject:, isNonIntegerObject:, and storePointer:ofObject:withValue:/fetchPointer:ofObject: methods, and then the code generated by the CogObjectRepresentation subclass, and modify the bootstrap to produce an image with the new tagging. The VM is factored so that this isn't horribly difficult. But it's work. Perhaps it would be a good student project?

Another thing, I read in a java VM report that only 1% of objects needs a hash (other objects hash are unused), so their hash implementation took only 2 bits in the header, being: 01 -> no hash 10 -> hash is the object's address 11 -> hash is in a header extension Therefore, on first access to the hash, hash bits switches from 01 to 10, and the GC on objects with 10 switches the hash bits from 10 to 11 and add a header extension with the old address of the object. But I know I'm not teaching you anything there.

That's clever. Yes you are :-). Thanks.

What I would like to know is why did you choose in Spur to have a 22bits hash in the header over this implementation, and will the hash in Spur be calculated at each instantiation or lazily ?

I chose the 22 bit hash scheme because it is part of the idea of having class indices. Notice that an object's class index is 22 bits also. The point here is that a class's hash is its class index. So to instantiate a class one copies the class's hash into the instance's class index, avoiding searching the class table to find the class's index. Perhaps this isn't the most optimal approach but it is coherent, and worked well for 64-bit VisualWorks.

The hash calculation is lazy. The hash field is 0 when instantiated. Hence

Behavior>>identityHash "Answer a SmallInteger whose value is related to the receiver's identity. This method must not be overridden. Primitive. Fails if the receiver is not a Behavior. Essential. See Object documentation whatIsAPrimitive.

Do not override."

<primitive: 175> self primitiveFailed

InterpreterPrimitives>>primitiveBehaviorHash | hashOrError | self assert: ((objectMemory isNonImmediate: self stackTop) and: [self addressCouldBeClassObj: self stackTop]). hashOrError := objectMemory ensureBehaviorHash: self stackTop. hashOrError >= 0 ifTrue: [self pop: argumentCount + 1 thenPushInteger: hashOrError] ifFalse: [self primitiveFailFor: hashOrError negated]

SpurMemoryManager>>ensureBehaviorHash: aBehavior | newHash err | <inline: true> self assert: (coInterpreter addressCouldBeClassObj: aBehavior). (newHash := self rawHashBitsOf: aBehavior) = 0 ifTrue: [(err := self enterIntoClassTable: aBehavior) ~= 0 ifTrue: [^err negated]. newHash := self rawHashBitsOf: aBehavior. self assert: (self classAtIndex: newHash) = aBehavior]. ^newHash

Thanks for any answers,

Regards, Clement

cheers!

2013/11/12 Henrik Johansen henrik.s.johansen@veloxit.no

On 12 Nov 2013, at 11:55 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

...
nterest ~= time and: [skills]

OT: I swear, missing parenthesis will be the death of me!

-- best, Eliot

Jecel Assumpcao Jr.

3:14 p.m.

Eliot Miranda wrote:

...

On Tue, Nov 12, 2013 at 3:56 AM, Clément Bera wrote:

...
By the way as you are talking about Immediate objects tags and Spur's object representation, I read on other VMs reports that it is faster in overall performance to have pointers tagged with the xxx1 and SmallIntegers tagged with xxx0, because SmallIntegers arithmetics is then much faster.

I'm sure that was true 20 years ago when register operations were about the same speed as memory operations. But now that memory operations are at least an order of magnitude slower I suspect that adjusting the tags (all register operations) are in the noise. But I agree it *would* be interesting to measure.

Another issue is that the 680x0 had addressing modes that aren't common on current processors. Having SmallIntegers be xxx0 means you don't have to untag them to add or subtract and then retag them. Shifts and multiplies require a single scaling factor beyond the basic operation. But the xxx1 for OOPs introduce complications: if you are going to access a fixed instance variable, for example instance variable 3, then you can deal with it at assembly time:

MOVE.L (-1+4*(3+1),A5),D1 ; where A5 is the OOP

But if you are going to access a field in an Array then you have to do it at runtime:

MOVE.L (-1+4*FixedFields,A5,D2*4),D1 ; where A5 is the OOP and D2 is the index

While it takes several instructions to do the same thing on a current processor, the 680X0 took quite a few clock cycles to execute the single instruction above and it took several words in memory to encode it, so the gain was mostly imaginary.

-- Jecel

stephane ducasse

12:19 p.m.

...

Sadly, interest ~= time and: [skills] :/

so true for me too.

Stef

3833

Age (days ago)

3843

Last active (days ago)

vm-dev@lists.squeakfoundation.org

17 comments

7 participants

tags (0)

participants (7)

Clément Bera
Eliot Miranda
Henrik Johansen
Igor Stasenko
Jecel Assumpcao Jr.
Nicolas Cellier
stephane ducasse