CPU running smalltalk bytecode

Mon Feb 11 09:38:11 UTC 2002

On Sun, 10 Feb 2002, Tim Rowledge wrote:

> My take on things is that a possible and practical change in hardware
> that would benefit us (and many programs) would be an instruction cache
> that was precisely controllable by the programmer. A 2-4Mb i-cache that
> one could actually load the core vm into and _lock_ it in would be nice.

Would it? AFAIK, the bigger the cache, the slower (in terms of bandwidth
and/or latency) it is. Which is why we have layered caches L1, L2, L3,
RAM, HD. Also, programmers are notoriously bad for predictions of where
problems actually are. I'd rather profile and focus my efforts where
they do the most good.

Best yet, I'd much rather have a dynamic system that learns what is the
most important and puts that into the cache than to do it manually. For
something the size of squeak, it'll be more accurate and faster.

And the gains can be signifigant. Look at what I've done over the last few
months... And I *know* there's another 10-15% more in this VM, over my
methodcache, roottableoverflow, and BC. [*]

> An improvement on that might be to go back to the writable control store
> idiom, putting the vm 'above the bus'. A controllable d-cache might be
> useful in letting us make sure that recent contexts and important
> globals stay cached, stuff like that.

With an LRU cache discipline, this is almost assured. If its used a lot,
it'll be in the cache. (barring conflict misses). And, the CPU will figure
out when it is and *isn't* important fully automatically. I'd rather trust
its dynamical judgement than faulty imprecise predictions I might attempt
to make.

> However as I've said again and again (redundantly even), it's bandwidth,
> bandwidth and bandwidth.

Is it? This would seem to be more-so an implementation concern than an
inherent part of smalltalk. Right now, for the existing squeak, it may
very well be true, but is that an artifact of smalltalk, our
implementation of smalltalk, or of all major software systems on modern
hardware.

>
> You can increase 'real' bandwidth by making the machine faster - memory
> bus mainly. That is the 'purest' approach in a sense. I'm gut-level sure
> that a really simple 600MHz cpu with 600MHz memory would outperform a
> multi-GHz cpu with133MHz memory and caches and burstmode and and and,
> plus be much simpler (no caches to worry about, maybe no registers
> even). Sadly we can't buy such memory at Fry's. Yet :-) Watch for MRAM.
> Why can't we have an ARMcore and 128Mbytes on a single chip!
>
> You can increase 'apparent' bandwidth with caches, burst read memory,
> writeback buffers, bigger register sets, whatever. This makes the
> machine appear to be faster much of the time but introduces all sorts of
> uncertainties and complications - a cache miss can be very expensive if
> you're unlucky or incompetent. I suppose we could throw in parallel
> processing here, though it could also go above.
>

Its not really bandwidth so much as latency... Latency includes bandwidth,
but includes more than just that. Yeah, you can have gigabytes/second
bandwidth if you fling a CD across the room, but the latency stinks. The
purpose of cache is to decrease the *average* latency of a memory
transaction. This assumes that the hit rate is high enough that the
decreased latency is not overshadowed by the increased latency and
complexity of the cache.

> You can increase 'needed' bandwidth with software trickery to make
> better use of what you have. For us, a dynamic translator or possibly
> full native compilation would serve well. Instead of popping and pushing
> (sounds like drug dealing...) we optimize to storing things in registers
> most of the time, caching decoded oops (and making sure to cope with a
> gc!), squidging converse actions together, all that good stuff.

Yaw.. I'd love to write JIT for a real RISC machine with oodles of
registers. :) But not all machines have 64 registers (lucky bastards!)
On the x86, basically everything is pushing&popping from RAM, but at least
you save the per-bytecode instantiation overhead, and might get a little
simple optimization out of things.

Scott

[*] Without going to JIT or assembly or altering the interpreter core.
Just straight slang.  There's the patch to not check the timer unless
we're doing a long primitive, ~5% .There's the dispatch overhead of
decoding the compact classes for methoddispatch, ~2%. There's also the
header-size decoding in GC, ~2%. And even my large methodcache appears it
could be bigger.