Hi.
I am starting prototyping effort of add native multi-threading support to Squeak. Any suggestions, hints, advice would be appreciated.
Just a few days ago I started thinking about the same thing. For me, the main reason behind this is that I have a 2 processor computer. And if I had access to an n processor machine, and if I was able to run Squeak on it, I would like to take advantage of parallel processing.
First of all, I targeted the virtual machine for making it able to run on multiple processors at once. I thought that the virtual machine was some kind of cpu I already knew, like the i80386++, the 68xxx and so on. So, as I knew all those, I concluded that I could use the same tricks the guys at Intel and Motorola came up with to improve performance on their processors, but now on the virtual machine. I thought it would be nice to implement the following things:
vm1) To split the virtual machine into synchronized core parts. The first two parts I thought of were the fetch-execution thing. If one process was dedicated to look ahead for byte codes, while another one actually executed them, things would go faster (twice, like). This is exactly what the Intel guys did on the 80486 and then on all the pentiums. They gave it a different name, like Branch Prediction Buffer or so. They guess what will a jump instruction do, and tell the prefetch queue to fill itself as if the jump instruction had been executed in the predicted way. Here the picture becomes clearer. Because usually the time consuming instructions are message sends. So, just by keeping track of the methods that could be executed before the virtual machine asks for them will improve performance.
vm2) To optimize the instruction set. This is a rather under development idea. The fetcher could as well translate the byte codes into some other representation that allowed faster execution (like actual native code, for instance). I don't know much about byte codes, but from what I've seen on compiled methods, pushes, pops and returns tend to occur in almost every method. We could gain extra performance if we made the virtual machine superscalar as well. To allow for that, we could take a look at the independence of consecutive byte codes, and if independent, execute them both at the same time through instruction pipelines (real machine threads). This is done on all Pentiums and also a lot of other chips. It would be great to find a way of deciding if two consecutive message sends are independent of each other or not.
vm3) Split the virtual machine responsibility into a kind of virtual chipset stuff, consisting of a virtual cpu, a virtual bitblter, a virtual port manager and so on. For instance, we could decouple screen generation from what the virtual cpu is actually doing. The same goes with sound, communication and so. Some kind of separate memory image spaces and a kind of bus could be the thing to do. This was done on the Amiga. For all of you that used one, do you remember what the XT was with a similarly fast cpu? The difference was in the custom chips that released the cpu from doing heavy stuff as bitblitting, sound mixing and screen updating. Especially now, when screen updating can be so time consuming. To allow this, the Amiga used the 68000 cpu family from Motorola. Those cpus are very interesting, because besides twin data-instruction buses, orthogonal instruction set and a lot of registers they allowed asynchronous port management. That is, the cpu instructs the serial port to send a byte and continues executing instructions regardless of what the serial port is now doing. This doesn't happen on all chips.
Hint: this allows building a real Smalltalk virtual machine, putting each virtual cpu on silicon.
Crazy? 1) How would it be possible that several virtual cpus executed byte codes at the same time? Through something similar to Visual Smalltalk's SLLs? Maybe it's quite close to what an SLL is. Unmovable blocks of ram would be advisable for all this. For instance, for screen generation we could copy the AGP idea on new Intel motherboards. Their idea was to remove the video memory from the video card and use some kind of fast/chip ram like what the Amiga does (and did since who knows when). This is, memory for programs and data, and memory for screen, sound samples and so on. Considerable speedups could develop from this, because we would be splitting the tasks of the virtual machine into virtual cpus that execute specific byte codes designed just to do their task. This allows for independence of implementation, and also organizes primitives were they belong. Primitives for video only implemented in the video generating virtual cpu. Primitives for communications only implemented in the communication virtual cpu, and so on. This also allows flexible optimization over the hardware that is running the virtual cpus.
Furthermore, every of those virtual cpus would be able to execute in parallel with all the others (except perhaps multiple "general purpose" virtual machines). What about a multiprocessor version of the GC? Execution speed can be improved if we let n virtual processors replace object pointers simultaneously on the image, provided that there is more than 1 real cpu working. But I think this can be avoided. If a virtual cpu checks what objects are linked to the new objects in the new object pool (this can be done by tracking other VMs, or by making the VMs tell this special cpu when they make something point to a new object), when the GC trigger comes a list of just what needs to be changed could be already prepared.
This is more or less what I had thought about. I just love it when manufacturers tell you just a few words on how they improved speed and paint it as complicated as hell, when in fact those lines let you rethink the problem and arrive to the same conclusions on your own. This happens all the time with Intel, for instance. It's also quite astonishing to verify that there are not a lot of those ideas, and that they are not used all at once because if so, they would not make as much money as they do now...
Andres.
squeak-dev@lists.squeakfoundation.org