Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39) _without_ changing the representation of LargeIntegers by extending the Interpreter arithmetic primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
What do you think?
Stephan
P.S.: Is [Q] the right string to get attention from Squeak Central?
-----Original Message----- From: Stephan Rudlof [mailto:stephan.rudlof@ipk.fhg.de] Sent: Dienstag, 26. Oktober 1999 11:24 To: squeak Subject: RE: Better integer performance possible?
Dan, Michael and others,
There are three reasons LargeInts are slower:
It is all written in ST (no primitive support) The inner loops runs byte by byte [FYI, they were written when Squeak had 12-bit smallInts -- can you guess out why?] Squeak does not have a JIT -- yet.
If I'd have the freedom to change the representation of LargeInts I would like to implement this stuff as primitives in C.
We would happily invite a replacement LargeInt package into Squeak if it were at least as simple and understandable (I cannot imagine this not being the case!!), and if it did not add a significant overhead to the VM.
And change some Smalltalk methods to call the new primitives.
Timeline: Because I'm a new squeaker - but familiar with VW Smalltalk - and haven't change implementations of a VM - but implemented user primitives for VW Smalltalk - I'm planning quiet some time to become familiar with it. I could start in the mid of November - then I will have holidays - and I think I could finish around christmas.
What do you think?
Greetings,
Stephan Rudlof
sr (stephan.rudlof@ipk.fhg.de) "Genius doesn't work on an assembly line basis. You can't simply say, 'Today I will be brilliant.'" -- Kirk, "The Ultimate Computer", stardate 4731.3
-----Original Message----- From: Dan Ingalls [mailto:Dan.Ingalls@disney.com] Sent: Montag, 25. Oktober 1999 22:32 To: squeak@cs.uiuc.edu Subject: Re: Better integer performance possible?
Stephan Rudlof stephan.rudlof@ipk.fhg.de wrote...
The integer performance of Squeak is poor compared with other
Smalltalks,
especially for big ints (~ factor 500) and not so much for
small ones (~
factor 3).
What's the reason? Does anybody know if improvements are possible?
There are three reasons LargeInts are slower:
It is all written in ST (no primitive support) The inner loops runs byte by byte [FYI, they were written when Squeak had 12-bit smallInts -- can you guess out why?] Squeak does not have a JIT -- yet.
There is one reason SmallInts are slower:
Squeak does not have a JIT -- yet. Also, in the factorial example, it may matter that Squeak allocates real objects for all contextx
Ask Ian about Jitter -- he has been having fun with it recently. When he's done, I think the SmallInt comparison will be much better. LargeInts will be similarly better, but not like the Dolphin numbers. I personally have not ever found LargeInt speed to be a concern in any useful code.
We would happily invite a replacement LargeInt package into Squeak if it were at least as simple and understandable (I cannot imagine this not being the case!!), and if it did not add a significant overhead to the VM.
- Dan
PS: I know Squeak's bytecode speed got better on Windows in a recent VM, so 2.6 may be somewhat faster than 2.5.
WinNT Dolphin Smalltalk Version 2.1 Patch Level 2:
Time millisecondsToRun: [2000 factorial]. "big ints" 76 75 76 Time millisecondsToRun: [1000 timesRepeat: [12 factorial]].
"small ints"
11 11 10
WinNT Squeak 2.5: Time millisecondsToRun: [2000 factorial]. "big ints" 34520 34520 35080 Time millisecondsToRun: [1000 timesRepeat: [12 factorial]].
"small ints"
30 30 30
Stephan
At 01:22 AM 10/30/99 +0200, you wrote:
Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39) _without_ changing the representation of LargeIntegers by extending the Interpreter arithmetic primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
What do you think?
Stephan
P.S.: Is [Q] the right string to get attention from Squeak Central?
<grin> with OOPSLA coming up I should think it will take more than a "[Q]" to get Squeak Central's attention ;^)
"No matter where you go, there you are." ......... Buckaroo Banzai
mailto:jbuff@pacific.net
/^X-UID:/,/^Received: / { /^$/d }
Something I've casually looked at is making LargeIntegers just be arrays of SmallIntegers. This way you leverage the SmallInteger primitives and should gain considerable performance -- even with a Smalltalk only implementation (a primitive or two might be desireable to make the shift from SmallIntegers to LargeIntegers and back more graceful). This approach has actually been used in the past in some Lisps (don't know if any still do this). One thing you could do is to simply make negation of a LargeInteger be the negation of each of its SmallInteger components -- which should allow combining LargePositiveInteger and LargeNegativeInteger into a single LargeInteger class, along with a few other nice side effects.
IMO, it just seems like a good first step to try before diving into creating the LargeInteger primitives, and should give good insight into handling larger "digits".
-- Dwight
Stephan Rudlof wrote:
Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39) _without_ changing the representation of LargeIntegers by extending the Interpreter arithmetic primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
What do you think?
Something I've casually looked at is making LargeIntegers just be arrays of SmallIntegers. This way you leverage the SmallInteger primitives and should gain considerable performance -- even with a Smalltalk only implementation (a primitive or two might be desireable to make the shift from SmallIntegers to LargeIntegers and back more graceful). This approach has actually been used in the past in some Lisps (don't know if any still do this). One thing you could do is to simply make negation of a LargeInteger be the negation of each of its SmallInteger components -- which should allow combining LargePositiveInteger and LargeNegativeInteger into a single LargeInteger class, along with a few other nice side effects.
In theory this is great, but in practice I don't see how it can work. If you are going to use only the low 8 to 15 bits, it might work, but I don't see any real savings in speed deriving from that experiment. If you are going to use between 15 to 30 bits (SmallInteger only has 30 bits for positive numbers), then you will have to deal, directly or indirectly with overflows for your multiplication operations, perhaps by breaking it up into pieces and shifting and stuff (slowing it back down), or using an underlying LargeInteger operation to fill in (same thing).
I think building a plugin instead, particularly because a great deal of the existing code in the image can be appropriated (the preamble code needs fixing, but the rest probably can be made to work) is probably not too difficult, certainly for the innermost loops. A profile shows that for reasonably large numbers 90-98% is presently spent inside the innermost routines.
Thus, a plugin could in theory be built (without too much sweat) that wouldn't slow up the status quo much at all, and would run much faster with a plugin installed.
I believe that at the time Smalltalk-80 blue book was written, there probably wasn't much real need for big number computations, and the thing cranked just fine for the usual baby examples. But crypto may be the "killer need" for good, fast, built-ins. Particularly if we can make this seamlessly pluggable for more speed, perhaps this is a worthwhile project.
But we don't need to wait for CS, certainly for the simple undertaking to pluginize inner code for the private longInt routines for Number, if only to take the profiles to see if that is enough to make a real difference.
Of course, we could just write pluggable primitive code for each primitive directly (since they presently all just fail right now) without meaningful cost (although it would likely require some fiddling with Number and some of the coercion code to make SmallInteger + Large---Integer work fast. But why do all that just now?
I think building a plugin instead, particularly because a great deal of the existing code in the image can be appropriated (the preamble code needs fixing, but the rest probably can be made to work) is probably not too difficult, certainly for the innermost loops. A profile shows that for reasonably large numbers 90-98% is presently spent inside the innermost routines.
Thus, a plugin could in theory be built (without too much sweat) that wouldn't slow up the status quo much at all, and would run much faster with a plugin installed.
Why not to use primitives? There are some free numbers (20-39) for this purpose and if it's faster then everybody will want it. And I don't see any image or sources size problem for these additions.
But we don't need to wait for CS, certainly for the simple undertaking to pluginize inner code for the private longInt routines for Number, if only to take the profiles to see if that is enough to make a real difference.
I wanted to know if anybody just works on this additions to avoid superfluous work and to strenghten my motivation!
Of course, we could just write pluggable primitive code for each primitive directly (since they presently all just fail right now) without meaningful cost (although it would likely require some fiddling with Number and some of the coercion code to make SmallInteger + Large---Integer work fast. But why do all that just now?
My opinion: - it's worthable for many people because doing that improves fundamental data types; - I see a chance for me to be successful in not so long time; - I'm learning much about calling C-routines from Squeak, what can be useful for more specialized purposes, too.
Stephan
I think building a plugin instead, particularly because a great deal of the existing code in the image can be appropriated (the preamble code needs fixing, but the rest probably can be made to work) is probably not too difficult, certainly for the innermost loops. A profile shows that for reasonably large numbers 90-98% is presently spent inside the innermost routines.
Thus, a plugin could in theory be built (without too much sweat) that wouldn't slow up the status quo much at all, and would run much faster with a plugin installed.
Why not to use primitives? There are some free numbers (20-39) for this purpose and if it's faster then everybody will want it. And I don't see any image or sources size problem for these additions.
Absolutely correct concerning image and source space. And primitive number space is not an issue. We are talking VM footprint. Right now, others are pluginizing many existing VM primitives so as to reduce space, not because they are not useful, but because not everyone uses them all the time and VM size seems a concern for some applications.
Is it so clear that everyone will want to make the space-speed tradeoff that a VM change would require? For many long integer performance is satisfactory, and even impressive. (My only uses were to compute certain odds for my daughter relating to her Pokemon card game, lottery odds and the occasional "factorial" demo, and of course, behind the scenes intermediate computations from SmallInts that I hadn't noticed when doing 32-bit operations)
Footprint of the VM on PDA's is precious for some. Doing it as a plugin (almost the same mechanism, by the way) permits those who want it to plugin and those who don't to pass.
But we don't need to wait for CS, certainly for the simple undertaking to pluginize inner code for the private longInt routines for Number, if only to take the profiles to see if that is enough to make a real difference.
I wanted to know if anybody just works on this additions to avoid superfluous work and to strenghten my motivation!
I had actually done this for addition (by far the easiest operation) when testing the new plugin compiler. If I can find it, I'd be happy to send it along.
Of course, we could just write pluggable primitive code for each primitive directly (since they presently all just fail right now) without meaningful cost (although it would likely require some fiddling with Number and some of the coercion code to make SmallInteger + Large---Integer work fast. But why do all that just now?
My opinion:
- it's worthable for many people because doing that improves fundamental
data types;
Speed performance only -- functionality should remain identical. There is a need for more speed only for certain applications (like crypto).
- I see a chance for me to be successful in not so long time;
True of both approaches. (I believe that you will find the plugin route a faster path.)
- I'm learning much about calling C-routines from Squeak, what can be useful
for more specialized purposes, too.
Also true of both approaches.
-----Original Message----- From: Andrew C. Greenberg [mailto:werdna@gate.net] Sent: Samstag, 30. Oktober 1999 15:48 To: squeak@cs.uiuc.edu Subject: RE: [Q] Project: Better performance for LargeIntegers
I think building a plugin instead, particularly because a great deal of the existing code in the image can be appropriated (the preamble code needs fixing, but the rest probably can be made to work) is probably not too difficult, certainly for the innermost loops. A profile shows that for reasonably large numbers 90-98% is presently spent inside the innermost routines.
Thus, a plugin could in theory be built (without too much sweat) that wouldn't slow up the status quo much at all, and would run much faster with a plugin installed.
Why not to use primitives? There are some free numbers (20-39) for this purpose and if it's faster then everybody will want it. And I
don't see any
image or sources size problem for these additions.
Absolutely correct concerning image and source space. And primitive number space is not an issue. We are talking VM footprint.
That's correct.
Right now, others are pluginizing many existing VM primitives so as to reduce space, not because they are not useful, but because not everyone uses them all the time and VM size seems a concern for some applications.
Is it so clear that everyone will want to make the space-speed tradeoff that a VM change would require? For many long integer performance is satisfactory, and even impressive. (My only uses were to compute certain odds for my daughter relating to her Pokemon card game, lottery odds and the occasional "factorial" demo, and of course, behind the scenes intermediate computations from SmallInts that I hadn't noticed when doing 32-bit operations)
Footprint of the VM on PDA's is precious for some. Doing it as a plugin (almost the same mechanism, by the way) permits those who want it to plugin and those who don't to pass.
I agree with the goal that not everything should be implemented as a primitive. But for LargeInteger arithmetic there are some primitive numbers free for just that case (though probably for smaller LargeInts in previous Smalltalks) and we are dealing with a very fundamental data type here. In addition we would have a - very moderate - bigger VM, but at the same time a _smaller_ image! Some methods are moved to the primitives then. So I don't see the advantage of parallelizing the computations in methods and Plugins.
But we don't need to wait for CS, certainly for the simple undertaking to pluginize inner code for the private longInt routines for Number, if only to take the profiles to see if that is enough to make a real difference.
I wanted to know if anybody just works on this additions to avoid superfluous work and to strenghten my motivation!
I had actually done this for addition (by far the easiest operation) when testing the new plugin compiler. If I can find it, I'd be happy to send it along.
Would be very interesting for me, but first I have to install Squeak onto my new Laptop at home with a fresh Linux ;-)
Of course, we could just write pluggable primitive code for each primitive directly (since they presently all just fail right now) without meaningful cost (although it would likely require some fiddling with Number and some of the coercion code to make SmallInteger + Large---Integer work fast. But why do all that just now?
My opinion:
- it's worthable for many people because doing that improves fundamental
data types;
Speed performance only -- functionality should remain identical. There is a need for more speed only for certain applications (like crypto).
--> But speed also doesn't hurt!
- I see a chance for me to be successful in not so long time;
True of both approaches. (I believe that you will find the plugin route a faster path.)
I don't know, but I thank you for the hint. At first it didn't seem to be easier to understand for me than the primitive approach, but I did ignore the compile times (see other mails)...
- I'm learning much about calling C-routines from Squeak, what
can be useful
for more specialized purposes, too.
Also true of both approaches.
On Sat, Oct 30, 1999 at 11:29:20AM +0200, Stephan Rudlof wrote:
Thus, a plugin could in theory be built (without too much sweat) that wouldn't slow up the status quo much at all, and would run much faster with a plugin installed.
Why not to use primitives? There are some free numbers (20-39) for this purpose and if it's faster then everybody will want it. And I don't see any image or sources size problem for these additions.
Actually, it turns out that it's easier to write pluggable primitives than to write the normal kind, once you get the hang of it. In my first amateurish attempts to write some primitives (for what later became the OSProcess plugin), I found that I was dealing with about a twenty minute compile time on my Linux Pentium 100 computer. This included generating the C code and building the new VM, and most of this time was taken in generating and building interp.c.
Once I figured out how to write the primitives in a pluggable module, I found that the whole thing can be done under a minute. This is because you only need to generate and build your plugin module, and the rest of the VM is untouched.
Now, don't get me wrong, there are real advantages to a twenty minute compilation cycle. Combined with a good cup of coffee, it really focuses the mind. None of this business of "let's just try this one little change and see what happens," you really need to plan ahead. But for testing a few new primitives, try doing it in a plugin and you'll save yourself a lot of time.
Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39) _without_ changing the representation of LargeIntegers by extending the Interpreter arithmetic primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
Perhaps 16 bits, but unlikely 32. How, for example, would you handle integer overflow when doing, say, the LargeInteger add? And how could you do the multiply without bit munging? Is it possible that a straight loop with carry would be much faster than all the masking and shifting necessary to do a carry written in C?
Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39)
_without_ changing
the representation of LargeIntegers by extending the Interpreter
arithmetic
primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
Perhaps 16 bits, but unlikely 32. How, for example, would you handle integer overflow when doing, say, the LargeInteger add?
My first idea was to take one small and one big C-integer type, represent digits with the small and perform the operations with the bigger one. If the bigger one is twice as big as the smaller there are no problems with mult. I think on all machines with ANSI-C these conditions are met; at _least_ we could take an 8 bit val for representation and make the computation in 16 bit, at best - so far - representation in 32 and computation in 64. But here we introduce platform specific stuff, which has to be handled. By using the Smalltalk to C compiler without changing the current representation we are _totally_ platform independent (no #defines in C at all) and in spite of this hopefully much faster than without compiling in C.
And how could you do the multiply without bit munging? Is it possible that a straight loop with carry would be much faster than all the masking and shifting necessary to do a carry written in C?
That could be, but then you'd introduce assembler, what is _very_ platform dependent.
Because I've never compiled a Squeak so far I think it's best for me first to take the straigthforward approach and then to think of other representations...
Stephan
Dear Squeakers,
I have changed my mind a little bit after studying some Squeak sources.
Now I plan to implement LargeInteger primitives (20-39)
_without_ changing
the representation of LargeIntegers by extending the Interpreter
arithmetic
primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger 'digits' (16,32,.. bits?) _could_ be interesting.
Perhaps 16 bits, but unlikely 32. How, for example, would you handle integer overflow when doing, say, the LargeInteger add?
My first idea was to take one small and one big C-integer type, represent digits with the small and perform the operations with the bigger one. If the bigger one is twice as big as the smaller there are no problems with mult.
I agree that would work for 16-bit digits, but you need stronger stuff for 32. A downside is that this would not work in any of the simulated Smalltalk, since SmallIntegers are not full 32-bit values. The resulting Smalltalk code would only be useful for compilation.
I think on all machines with ANSI-C these conditions are met; at _least_ we could take an 8 bit val for representation and make the computation in 16 bit, at best - so far - representation in 32 and computation in 64.
Indeed, that's exactly what is done right now. bytes are accumulated by addition and multiplication into SmallIntegers.
But here we introduce platform specific stuff, which has to be handled. By using the Smalltalk to C compiler without changing the current representation we are _totally_ platform independent (no #defines in C at all) and in spite of this hopefully much faster than without compiling in C.
I don't understand. Of course we would use the Smalltalk to C compiler, but won't will still have to compile the C code? It does raise a byte endian gender issue, that inside the C-code is irrelevant, even though different platforms will execute differently, but for the remaining smalltalk code than manipulates digits of the object will need to address. A simple solution is to implement a digitAt: function as well, so that your plugin or core code will work fine across platforms, bigendian or otherwise.
I see no reason why any implementation need be platform specific. Done properly, the generated code should work fine with any conforming C compiler supporting 32-bit ints. (Note that Squeak does expect the C compiler to treat ints as 32-bitters.)
And how could you do the multiply without bit munging? Is it possible that a straight loop with carry would be much faster than all the masking and shifting necessary to do a carry written in C?
That could be, but then you'd introduce assembler, what is _very_ platform dependent.
Because I've never compiled a Squeak so far I think it's best for me first to take the straigthforward approach and then to think of other representations...
Are you planning to rewrite as numbered primitives for incorporation in the VM? Why not make a pluggable primitive instead, so that folks who don't need speedier longs (most users) won't need to have the code in their VM footprint? Done right, the code should run identically, albeit slower, without the plugin installed as it does when the plugin is installed.
-----Original Message----- From: Andrew C. Greenberg [mailto:werdna@gate.net] Sent: Samstag, 30. Oktober 1999 14:54 To: squeak@cs.uiuc.edu Subject: RE: [Q] Project: Better performance for LargeIntegers
Dear Squeakers,
I have changed my mind a little bit after studying some
Squeak sources.
Now I plan to implement LargeInteger primitives (20-39)
_without_ changing
the representation of LargeIntegers by extending the Interpreter
arithmetic
primitives using the CCodeGenerator. This avoids some problems dealing with different platforms.
In a later step changing the representation to bigger
'digits' (16,32,..
bits?) _could_ be interesting.
Perhaps 16 bits, but unlikely 32. How, for example, would you handle integer overflow when doing, say, the LargeInteger add?
My first idea was to take one small and one big C-integer type, represent digits with the small and perform the operations with the bigger
one. If the
bigger one is twice as big as the smaller there are no problems
with mult.
I agree that would work for 16-bit digits, but you need stronger stuff for 32. A downside is that this would not work in any of the simulated Smalltalk, since SmallIntegers are not full 32-bit values. The resulting Smalltalk code would only be useful for compilation.
I agree.
I've forgotten to say that at the time of 'my first idea' I didn't thought of using the CCodeGenerator... I wanted to do the whole stuff in C.
I think on all machines with ANSI-C these conditions are met; at
_least_ we
could take an 8 bit val for representation and make the computation in 16 bit, at best - so far - representation in 32 and computation in 64.
Indeed, that's exactly what is done right now. bytes are accumulated by addition and multiplication into SmallIntegers.
But here we introduce platform specific stuff, which has to be
handled. By
using the Smalltalk to C compiler without changing the current representation we are _totally_ platform independent (no #defines in C at all) and in spite of this hopefully much faster than without
compiling in C.
I don't understand. Of course we would use the Smalltalk to C compiler, but won't will still have to compile the C code? It does raise a byte endian gender issue, that inside the C-code is irrelevant, even though different platforms will execute differently, but for the remaining smalltalk code than manipulates digits of the object will need to address. A simple solution is to implement a digitAt: function as well, so that your plugin or core code will work fine across platforms, bigendian or otherwise.
'Without changing the current representation' and building the C-code in analogue to the existing Smalltalk implementation I cannot see any byte endian problem, this approach includes just exactly 'A simple solution is to implement a digitAt: function as well'.
I see no reason why any implementation need be platform specific. Done properly, the generated code should work fine with any conforming C compiler supporting 32-bit ints. (Note that Squeak does expect the C compiler to treat ints as 32-bitters.)
I've thought over the problem that there are the ANSI-C assertions (formally lazy written) sizeof(char) == 1 < 2 <= sizeof(short) <= sizeof(int) <= sizeof(long) && 2 <= sizeof(short) <= sizeof(int) && 4 <= sizeof(long) , but it's possible to have sizeof(short) == sizeof(int) == sizeof(long) for example. OK, I don't know such a computer... But what's needed is just a smaller data type for representation as for computing to totally do this in C.
Yet another problem: Pi * thumb I think it could be a problem to transfer the internal representation of a very big number by big - say 16 Bit - 'digits' in a printable 10 based one. I guess that there is needed some kind of string computation, isn't it?
And how could you do the multiply without bit munging? Is it possible that a straight loop with carry would be much faster than all the masking and shifting necessary to do a carry written in C?
That could be, but then you'd introduce assembler, what is
_very_ platform
dependent.
Because I've never compiled a Squeak so far I think it's best
for me first
to take the straigthforward approach and then to think of other representations...
Are you planning to rewrite as numbered primitives for incorporation in the VM? Why not make a pluggable primitive instead, so that folks who don't need speedier longs (most users) won't need to have the code in their VM footprint? Done right, the code should run identically, albeit slower, without the plugin installed as it does when the plugin is installed.
Later about plugins versus primitives...
squeak-dev@lists.squeakfoundation.org