[Vm-dev] PICs (was: RISC-V J Extension)

Tue Jul 24 23:18:18 UTC 2018

While it is bad form to move a private discussion to (or back to) a
public forum, some of these links might be interesting to people here
and I have been unable to send emails to Tobias after my initial reply.
An attempt on Wednesday and on Friday made mcrelay.correio.biz complain
that mx00.emig.gmx.net[ refused to talk to it and an attempt from my old
1991 email account on Monday complained about the email address though
it was ok as far as I can tell.

Tobias wrote:
> Jecel wrote:
> > [new direction: emulate bytecodes and RISC-V]
> 
> That'a an interesting take.
> 
> I can only watch from afar, but its all interesting. (for example that guy 
> who does  RISC-V cpu in TTL chips: https://www.youtube.com/channel/UCBcljXmuXPok9kT_VGA3adg )

It is an interesting project. I was annoyed by his claim to have the
first homebrew TTL 32 bit processor since in the late 1990s a group of
students at the MIT processor design course implemented the Beta
processor in TTLs instead of using FPGAs like all other groups (before
or since). Sadly, all information about this has been eliminated from
the web and can't even be found in archive.org.

I tried to get the local universities to teach RISC-V to their students
instead of their own educational RISC processors but they are too
emotionally attached to their designs.

> Sounds reasonable. Let's have them know dynamic languages are also still there ;)
> (I mean, you're very familiar with both Smalltalk and Self...)

Mario Wolczko has been involved in Java since the late 1990s but was
part of the Self group before that and had created the Mushroom
Smalltalk computer before that.

http://www.wolczko.com/

Boris Shingarov is currently involved with Java but has given a lot of
talk about Smalltalk VMs and was involved in Squeak back in the OS/2
days.

http://shingarov.com/

With me, that was 3 out of 6 people at the meeting representing the
Smalltalk viewpoint. We shall see if that will have any practical
effect.

> The TLB is somewhat maintained by the CPU to manage the translation of virtual addresses to physical ones.
> 
> I can imagine something similar, like a branch, that upon return, updates a filed
> in a PIC buffer, such that the next time the branch is only taken if a register (eg, class of the object) 
> is different or so.

Ok, Mario actually mentioned that with today's advanced branch
prediction hardware we might want to re-evaluate PICs. In this case you
wouldn't be using the TLBs but the BTB (Branch Target Buffer) hardware.

https://www.slideshare.net/lerruby/like-2014214

Mario might have actually been thinking about Urs Hölzle's ECOOP 95
paper, which was a slightly different subject.

http://hoelzle.org/publications/ecoop95-dispatch.pdf

They were looking at the different kinds of software implementation of
method dispatch (not only PICs) and the effects of processors executing
more and more instructions per clock cycle. That might make a scheme
that is bad for a simple RISC (due to many tests, for example) actually
work well on an advanced out-of-order processor (due to the test being
"free" since they execute in parallel with the main code). They didn't
look at branch prediction hardware, but it certainly would have a huge
impact. Several of the later papers focused on branch prediction:

http://hoelzle.org/publications.html

> > For SiliconSqueak I actually had two different PIC instructions. They
> > modified how the instruction cache works. Normally the instruction cache
> > is accessed by hashing the 32 bit value of the PC except for the lowest
> > bits which select a byte in the cache line, but after a PIC instruction
> > the hash used a 64 bit value that combined the PC (all bits) and the
> > pointer to the receiver's class. The resulting cache line was fetched
> > and instructions executed in sequence even though the PC didn't change.
> > Any branch or call instruction would restart normal execution at the new
> > PC.
> 
> Sounds neat!
>
> > So a PIC entry takes up exactly one cache line. A PIC can have as many
> > entries as needed and the instruction takes the same time to execute no
> > matter how many entries there are (not taking into account cache
> > misses).
>
> Wow thats incredible.
>
> > The second PIC instruction works exactly like the first but it supplies
> > a different value to be used in place of the current PC. That allows
> > different call sites to share PIC entries if needed, though that might
> > be more complicated than it is worth.
> 
> Maybe. What I like about PICs per send site is that you can essentially use them
> as data source for dynamic feedback (what "types" where actually seen at this send site?)
> and one probably would need some instructions to fetch those infos from the PIC.

One of the papers in that list is the 1997 techical report "The Space
Overhead of Customization". One of the reasons that Java won over Self
was that its simple interpreter ran on 8MB machines that most of Sun's
customers had while Self needed 24MB workstations which were rare (but
would be very common just two years later). Part of that was due to
compiling a new version of native code for every different type of
receiver even if the different versions didn't really help.

My idea of allowing PICs to be optionally shared was that this would
allow customization to be limited in certain cases to save memory. It
would cause a loss of information about types seen at a call site, but
that doesn't always have a great impact on performance.

-- Jecel