Hi Jecel,
on Tue, 09 Oct 2007 03:40:28 +0200, you wrote: [ :many | many, interesting things ]
... With agressive inlining, however, highly optimized methods will need to access more registers.
I've thought over this for [howmany?] years now, time again. It is perhaps possible to introduce inlineable blocks which don't need much more register space. Access to outer scope could be done similiar to what the B5xxx did with her display registers. And it is well known that that lad runs perfectly with just 4 display registers (and only very large sisters of her had surplus, $$ expensive display registers). The concept at work here is D[ll] with ll=the current scope level (in our case the inlined material's data). D[ll] points to D[ll-1], etc, (in our case to whom it was inlined), from which data registers can be loaded. What do you think?
Since the prefix instructions were only defined for immediate operands, their use with non immediate instructions has been defined to extend the destination and source fields. This allows up to 256 extra local registers to be used.
Yeah, that's an alternative.
The prolog and epilog code for such methods will not be trivial as they will have to deal with allocating a continuous chunk of physical registers among the various threaded lists of frames. But this shouldn't be a problem as such methods will execute for quite a while (or they wouldn't have been compiled with so much inling in the first place) making this overhead worth it.
And this is what I have today and am starting to implement. It is a bit more complex than I would like, but I feel the extra features are important for it to do well interpreting bytecodes, running highly factored native code and running deeply inlined native code.
Good luck!
Cheers Klaus
-- Jecel _______________________________________________ Hardware mailing list Hardware@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/hardware
Klaus D. Witzel wrote:
... With agressive inlining, however, highly optimized methods will need to access more registers.
I've thought over this for [howmany?] years now, time again. It is perhaps possible to introduce inlineable blocks which don't need much more register space. Access to outer scope could be done similiar to what the B5xxx did with her display registers. And it is well known that that lad runs perfectly with just 4 display registers (and only very large sisters of her had surplus, $$ expensive display registers). The concept at work here is D[ll] with ll=the current scope level (in our case the inlined material's data). D[ll] points to D[ll-1], etc, (in our case to whom it was inlined), from which data registers can be loaded. What do you think?
I suppose that by now you won't be exactly shocked if I say that I have already done that in one of my projects? The 16 bit processor named "Oliver" started out as a small variation on Chuck Moore's MISC Forth processor. The instruction set was essentially the same (simple stack machine with five bit opcodes) but it had an extra register named "SELF". All addresses, including the PC, were relative to this register and besides the normal call instruction there was a version that would pop the top of the data stack into SELF. Calls saved the previous value of SELF along with PC into the return stack, and returns naturally restored both registers.
The details are not important, but in 2002 my clients observed that Forth seemed a bit hard to learn while the Smalltalk I was using for my main project would be learned by children. So they asked if I couldn't do their project in Smalltalk too. Initially I said it wasn't possible: the FPGA I was using for their machine only had 15 thousand gates to keep the product really cheap (under $20) while I planned to use a 300 thousand gate FPGA for the children's computer (in contrast, my current design uses an FPGA with 1.5 million gates!).
But after thinking about this for a month or so I saw that the only thing missing for the OO Forth processor to run Smalltalk was some kind of support for blocks. So I added two groups of 8 registers each which could be efficiently saved to/reloaded from memory. The first few registers in a group would be used exactly as display registers. So for a block three levels deep loaded into group A, you would have the local variables in A2 to A7 while A0 pointed to the group with the next lexical level and A1 to the group in the level beyond that (normally the home method). The group B registers are a scratch pad for non local variables, so to access something in external lexical scopes you load it into B (one instruction - it will first save the previous contents if these were dirty) if it is not already there (the compiler is keeping track) and just use it. I know this is more explict that the display registers in a Burroughs machine, but it is a good fit for a MISC.
For the stack processor in Plurion I could address as many registers as I liked using the prefix instructions to increase the operand field of any other instruction. So the above scheme was replaced by mapping higher lexical levels at 64 register intervals. Registers 0 to 63 (normally only 0 to 7 actually exist - you need to explicitly allocate more in groups of 8 if you want them) are the local variables, 64 to 128 (64 to 71, most likely) are the variables in the next lexical level, 129 to 193 the second level and so on.
-- Jecel
hardware@lists.squeakfoundation.org