Re: multithreading support - Squeak-dev

29 Jan 1998


      Hi.
...
I am starting prototyping effort of add native multi-threading support
to Squeak.  Any suggestions, hints, advice would be appreciated.
Just a few days ago I started thinking about the same thing. For me, the 
main reason behind this is that I have a 2 processor computer. And if I had 
access to an n processor machine, and if I was able to run Squeak on it, I would 
like to take advantage of parallel processing.
First of all, I targeted the virtual machine for making it able to run on 
multiple processors at once. I thought that the virtual machine was some kind 
of cpu I already knew, like the i80386++, the 68xxx and so on. So, as I knew 
all those, I concluded that I could use the same tricks the guys at Intel and 
Motorola came up with to improve performance on their processors, but now on 
the virtual machine. I thought it would be nice to implement the following 
things:
vm1) To split the virtual machine into synchronized core parts. The first 
two parts I thought of were the fetch-execution thing. If one process was 
dedicated to look ahead for byte codes, while another one actually executed them, 
things would go faster (twice, like). This is exactly what the Intel guys 
did on the 80486 and then on all the pentiums. They gave it a different name, 
like Branch Prediction Buffer or so. They guess what will a jump instruction 
do, and tell the prefetch queue to fill itself as if the jump instruction had 
been executed in the predicted way. Here the picture becomes clearer. Because 
usually the time consuming instructions are message sends. So, just by 
keeping track of the methods that could be executed before the virtual machine 
asks for them will improve performance.
vm2) To optimize the instruction set. This is a rather under development 
idea. The fetcher could as well translate the byte codes into some other 
representation that allowed faster execution (like actual native code, for 
instance). I don't know much about byte codes, but from what I've seen on compiled 
methods, pushes, pops and returns tend to occur in almost every method. We 
could gain extra performance if we made the virtual machine superscalar as well. 
To allow for that, we could take a look at the independence of consecutive 
byte codes, and if independent, execute them both at the same time through 
instruction pipelines (real machine threads). This is done on all Pentiums and 
also a lot of other chips. It would be great to find a way of deciding if two 
consecutive message sends are independent of each other or not.
vm3) Split the virtual machine responsibility into a kind of virtual 
chipset stuff, consisting of a virtual cpu, a virtual bitblter, a virtual port 
manager and so on. For instance, we could decouple screen generation from what 
the virtual cpu is actually doing. The same goes with sound, communication and 
so. Some kind of separate memory image spaces and a kind of bus could be the 
thing to do. This was done on the Amiga. For all of you that used one, do you 
remember what the XT was with a similarly fast cpu? The difference was in 
the custom chips that released the cpu from doing heavy stuff as bitblitting, 
sound mixing and screen updating. Especially now, when screen updating can be 
so time consuming. To allow this, the Amiga used the 68000 cpu family from 
Motorola. Those cpus are very interesting, because besides twin 
data-instruction buses, orthogonal instruction set and a lot of registers they allowed 
asynchronous port management. That is, the cpu instructs the serial port to send a 
byte and continues executing instructions regardless of what the serial port 
is now doing. This doesn't happen on all chips.
Hint: this allows building a real Smalltalk virtual machine, putting each 
virtual cpu on silicon.
Crazy? 1) How would it be possible that several virtual cpus executed byte 
codes at the same time? Through something similar to Visual Smalltalk's SLLs? 
Maybe it's quite close to what an SLL is. Unmovable blocks of ram would be 
advisable for all this. For instance, for screen generation we could copy the 
AGP idea on new Intel motherboards. Their idea was to remove the video memory 
from the video card and use some kind of fast/chip ram like what the Amiga 
does (and did since who knows when). This is, memory for programs and data, 
and memory for screen, sound samples and so on. Considerable speedups could 
develop from this, because we would be splitting the tasks of the virtual 
machine into virtual cpus that execute specific byte codes designed just to do 
their task. This allows for independence of implementation, and also organizes 
primitives were they belong. Primitives for video only implemented in the video 
generating virtual cpu. Primitives for communications only implemented in 
the communication virtual cpu, and so on. This also allows flexible 
optimization over the hardware that is running the virtual cpus.
Furthermore, every of those virtual cpus would be able to execute in 
parallel with all the others (except perhaps multiple "general purpose" virtual 
machines). What about a multiprocessor version of the GC? Execution speed can be 
improved if we let n virtual processors replace object pointers 
simultaneously on the image, provided that there is more than 1 real cpu working. But I 
think this can be avoided. If a virtual cpu checks what objects are linked to 
the new objects in the new object pool (this can be done by tracking other 
VMs, or by making the VMs tell this special cpu when they make something point 
to a new object), when the GC trigger comes a list of just what needs to be 
changed could be already prepared.
This is more or less what I had thought about. I just love it when 
manufacturers tell you just a few words on how they improved speed and paint it as 
complicated as hell, when in fact those lines let you rethink the problem and 
arrive to the same conclusions on your own. This happens all the time with 
Intel, for instance. It's also quite astonishing to verify that there are not a 
lot of those ideas, and that they are not used all at once because if so, 
they would not make as much money as they do now...
Andres.