[squeak-dev] 30 bit unboxed floats

Thu Oct 21 10:08:37 UTC 2010

Eliot,

> [Float in Squeak and VW]
> Does that answer your question?

Yes, I was trying to figure out whether you were more interested in the
mantissa or the exponent. My suggestion would eat up two bits of the
mantissa, but have the same exponent as 32 bit floats.

> [we can have 64-bit FloatArrays]

Sure, and since FloatArrays are use in a much more explicit ways than
loose numbers, adding a DoubleArray class would only affect new code.

> In the immortal works of the country bumbkin asked how to get to
> some city "I wouldn't start from here".
> Hardware can accelerate boxing and unboxing of floats just as Cog
> can.  Cog has special machine code allocation for Floats and of course
> Float is a compact class, which helps in testing for float-ness.

Ah, I'll have to look at the special allocation code. It was doing a
#new on each math operation that was worrying me. Touching main memory
in a complex expression that would otherwise mostly deal with the stack
seems like a waste.

> But IMO the current Squeak image format is too complex and too slow
> and we'd do better coming up with a better object representation and
> GC representation, which implies focussing on kernel/microsqueak/tracer
> efforts to produce a kernel image and VM work on implementing a new GC.

I fully agree with that. Back when processors were slow, having what is
in the memory and on the disk be as similar as possible was a big win.
These days it is faster to load a compressed file, for example, then the
original. For Neo Smalltalk I had a format that could be loaded or saved
in either a 16 bit or a 36 bit implementation. All integers were
infinite precision immediate values, for example. So a given number
might become a SmallInteger on one machine but a boxed
LargePositiveInteger on another. On disk it was always the same.

> Jecel, how flexible is your design methodology?  If the bytecode set
> or the object representation were to change how much work would
> be involved, just some redeclaration or a manual rewrite or
> somewhere in between?

Here is a rather long answer to this simple question:

We have two levels of flexibility: 1) the machines we are currently
building use programmable chips (FPGAs) so anything can be replaced. We
can trade our processor for a Sparc (like the open source Leon 3) even
at runtime, for example. 2) there is some flexibility in the
architecture of SiliconSqueak itself, so some future version in a custom
chip (ASIC) still allows some options, as I will describe below.

Note: a lot of people wonder why, if FPGAs are so cool and flexible,
don't we stick with them forever and why doesn't Intel or AMD use them
for their own processors. The flexibility comes at a cost - a $70 FPGA
like we are using could be used to implement the equivalent of a $14 ARM
chip but running at under 100MHz instead of at 1GHz. That is a factor of
50 in price/performance.

Back to SiliconSqueak - it can handle both bytecodes and its own native
32 bit instructions. The instruction cache can handle a mix of these,
but there is a special 4KB area of this cache that is pinned down for
the 32 bit instructions. When a bytecode is fetched, its value is used
to do a 256 way call into that area (16 bytes, or 4 instructions, per
bytecode). By replacing the content of that 4KB region, you can execute
an entirely different set of bytecodes (for Java or Python, for
example).

In addition to this table, there is a hardware "translator" which can be
optionally activated. This is hardwired to a single bytecode set. For
each bytecode, it generates a 32 bit instruction to be executed. For
simple bytecodes, this single instruction is all that is needed and the
next bytecode is fetched. For more complex bytecodes (send and return,
essentially), this is the first instruction which executes in parallel
with the jump into the 4KB table from where the second instruction will
be fetched. So a bytecode set that doesn't use the special hardware
takes two clock cycles to execute the most common bytecodes while the
one that the translator knows only takes one clock cycle for most
bytecodes.

About the object format, the Mushroom trick of a virtual cache is used.
That means that an instruction that wants to push instance variable 23
in object 16r32416780 will present this pair of 32 bit numbers to the
cache and get a result in the next clock in the case of a hit. In the
case of a miss, some software must update the cache. It might suppose
that 16r32416780 is the byte address of the object and fetch the few
words around that +4*23. It might suppose that this number is a virtual
address and look into a tree of tables for the corresponding physical
address. It might suppose that this number is a "handle" (in the old Mac
style and some Java implementations) and do an indirection to find the
physical address.

So different object representations are possible with different cache
miss handlers. But note that this is being developed - for now I am
using a simple hardware to reload the cache so the first option is
currently hardwired. I'll fix this as soon as possible, but it requires
getting some tricky thread switching right.

In addition to this, there is some hardware support for very specific
Squeak stuff which might complicate implementing very different
bytecodes a bit. And there is already one difference relative to Cog -
the data stack and the return stack are separate in order to keep
temporaries and arguments together. That doesn't mean you can't just
ignore this hardware and use a single stack, however.

-- Jecel