[Vm-dev] [Pharo-dev] Pony for Pharo VM: Pony papers

Shaping shaping at uurda.org
Tue Apr 21 07:19:27 UTC 2020





From: Pharo-dev [mailto:pharo-dev-bounces at lists.pharo.org] On Behalf Of Shaping
Sent: Tuesday, 21 April, 2020 01:05
To: 'Robert' <robert.withers at pm.me>; 'Open Smalltalk Virtual Machine Development Discussion' <vm-dev at lists.squeakfoundation.org>; 'Pharo Development List' <pharo-dev at lists.pharo.org>
Subject: Re: [Pharo-dev] [Vm-dev] Pony for Pharo VM



The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!


Yes, that is one way.  Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge).  We already know that Orca works in Pony (in high-performance production—not an experiment or toy).  Still there will be bugs and perhaps room for improvements.  Smalltalk simulation would help greatly there.  The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.

The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes


Pony is a language, compiler and runtime.  The compiler converts Pony source to machine code.


 on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.


Pony is not a VM, either--no bytes codes.  We would be studying Orca structure in the Pony  C/C++, how that fits with the ref-caps, and then determine how to write something similar in the VM or work Smalltalk dynamic types into the existing Pony C/C++ (not nearly as fun, probably).


 I’m starting to study the Pharo VM.  Can someone suggest what to read.  I see what appears to be outdated VM-related material.  I’m not sure what to study (besides the source code) and what to ignore.  I’m especially interested to know what not to read.

I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed. 


How do Pharo’s and Squeak’s VMs differ?  I thought OpenSmalltalkVM was the common VM.  I also read something recently from Eliot that seemed to indicate a fork.  

I thought Pharo had the new tools, like GT, but I’m not sure.  I don’t follow Squeak anymore.  

Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.


The problem is easy to understand.  It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor.  Orca does that already and demonstrates very high performance.  That’s what the Orca paper is about.

The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space.


Yeah—no, it won’t work.  Sympathies.  Empathies.




Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time.  The video details are fuzzy, but you can get a good feel for message flow. 


This should have happened first in Smalltalk.  


The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

 - 12467 tallies, 12696 msec.

13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms


Not acceptable.  Too long.  


and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.


Not acceptable.  Too long.  And, worse, it won’t scale.  The problem is not the percentage; it’s the big delays amidst other domain-specific computation.  These times must be much smaller and spread out across many pauses during domain-specific computations.   No serious real-time apps can be made in this case.


I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.  


 The solution for Smalltalk is more complicated, and will involve a concurrent collector.  The best one I can find now is Orca.  If you know a better one, please share your facts.


As different event loops on different cores will use the same 


externalizing remote interface


This idea is not clear.  Is there a description of it?

So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls? 


Why would the idea of ‘remote’ enter here?  The execution scope is an OS process.  Pony actors run on their respective threads in one OS process.  Message passing is zero-copy; all “passing” is done by reference.  No data is actually copied.  The scheduler interleaves all threads needing to share a core if there are more actors than cores.  Switching time for actor threads, in that case, is 5 to 15 ns.  This was mentioned before.  Opportunistic work stealing happens.  That means that all the cores stay as busy as possible if there is any work at all left to do.  All of this happens by design without intervention or thought from the programmer.  You can read about this in the links given earlier.  I suggest we copy the design for Smalltalk.


If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.



to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities


Pony doesn’t yet have machine-node remoteness.  The networked version is being planned, but is a ways off still.  By remote, do you mean:  another machine or another OS/CogVM process on the same machine?

Yes, I mean both. I also mean between two event loops within the same process, different threads.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small).  A scheduler keeps all cores busy, running and interleaving all the current actor threads.  Message tracing maintains ref counts.  A cycle-detector keep things tidy.  Do Squeak and Pharo have those abilities?



to share workload.


With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.

When I said sharing of workload is a primary challenge, I do not mean explicitly managing concurrency, the event loop ensures that concurrency safety. I meant that the design of a parallelized application into concurrent actors is the challenge,


If you can write a state-machine with actors that each do one very simple, preferably reusable thing in response to received async messages, then it’s not a challenge.  We do have to learn how to do it.  It’s not what most of us are used to.  Pony is a good tool for practicing, even if the syntax is not interesting.  Still, as mentioned, we should make tools to help with that state-machine construction.  That comes later, but it must happen.


Pony has Actors.  It also has Classes.  The actors have behaviours.  Think of these as async methods.  Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects.  Doing this last bit well is the task that concerns me most.  


that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.


I suggest a study of the Pony scheduler.  There are actors, mailboxes, message queues, and the scheduler, mainly.   You don’t need to be concerned about safety.  It’s been handled for you by the runtime and ref-caps.


  That’s one of basic reasons for the existence of Pony-Orca.  The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores.  Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator.  Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.


I don’t have time for Pony programming these days--I can’t even read about these days.  Go ahead if you wish.


Your time is better spent in other ways, though.  The speed and scale advantages of Orca over the big-heap approach have been demonstrated.  That was done some time ago.   Read the paper by Clebsch and friends for details.  Read Wallaroo Lab’s field-experience whilst preparing to use Pony.  Or better, learn to write a Pony program.  If your resources don’t allow that, chat with Rocco Bowling (link above).  Everyone on Pony Zulip is very helpful and super-enthusiastic about Pony—and it doesn’t even have its own debugger the last time I checked.  The tooling is poor, and people still love this thing.  Odd.


The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.


No, that’s much slower.  Squeak/Pharo still has the basic threading handicap:  a single large heap.

In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps,


That would be too few actors, in general.  We are not thinking on the same scale for speed and actor-count.  

Expect actor counts to scale into the thousands or tens of thousands.  There are about 100 in the app above.   


1 per process. There will be a finite number of Capability actors in each event loop. This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops. 

 Here’s the gist of the problem again:  the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.  

I am not convinced of this.

You must read of others’ measurements, or write your own programs, and do the tests to get those measurements.  Read about the measurements made in the academic paper I cited.   That’s the easy way.  You can also read the one from Sebastian Blessing from 2013:  https://www.ponylang.io/media/papers/a_string_of_ponies.pdf


 My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working.  You don’t need to do anything special as a programmer to make that happen.  You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual.  Functions run synchronous code.  Behaviours run asynchronous code.

My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous.


Yup, but only for a while.  Then you get used to it, and can’t imagine anything different, like not having a big heap.


 The parallel design space holds primacy.


No, strictly, the state-machine design does.  The parallelization is done for you.  


You’re not parallelizing anything.  That’s not your job.  (What a relief, yes?)  You’re an application programmer.  You’re writing a state-machine for your app, and distributing its work across specialized actors, which you code and whose async messages to each other change object data slots (wherever they happen to live—which need not concern you), and thus change the state of the state-machine you designed.  


You can’t use the multicore hardware you already own or the goodness in the Orca and ref-cap design if you can’t write a state-machine, and use actors, or don’t have a tool to help you do that.  Most of us will want to use such a tool even if we are fluent at state-machine design.  This doesn’t even exist in Pony.  It’s very raw over there, but you get used to the patterns, as with any new strategy.  Still I want a tool.   Don’t you?


Two tasks:  1) build tools to help us make state-machines in a reliable pleasant way, so that we feel compelled and happy to do it; and 2) implement Pony-style scheduling, ref-caps, and Orca memory management work in Smalltalk. 


 The issue is not whether to use Pony.  I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk.  I like Smalltalk, who concurrency model is painfully lame. 

Squeak concurrency model.

Installer ss
    project: 'Cryptography';
    install: 'CapabilitiesLocal'

What abilities does the above install give Squeak?

This installs a local only (no remote capabilites) capabilities model that attempts to implement the following in Squeak, the E-Rights capabilities model. [3] This also ensures inter-actor concurrency safety.

So your use of Pony is purely to access the Orca vm?


Orca is not a VM; it’s a garbage collection protocol for actor-based systems.  


I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees).  I realize that this is a big undertaking.  Or:  write a new implementation of Orca in Smalltalk for the VM.  This is currently second choice, but that could change.


I think you will find the CogVM quite interesting and performant. 


--Not with its current architecture.


If the CogVM is not able to:

1) dynamically schedule unlimited actor-threads on all cores

Why not separate actor event-loop processes on each core, communicating remotely? [4][5]


--Because it will continue the current Smalltalk-concurrency lameness.  It’s a patch.  And still it will not allow the system to scale.  The concurrency problem has been solved nearly optimally and at high resolution in the current Pony-Orca.  There’s room for improvement, but it’s already in a completely different performance league compared to any big-heap Smalltalk.  If I’m to work hard on an implementation of this design for Smalltalk, I need a much greater speed-up and scaling ability than what these patches give.  


2) automatically load-balance

Use of mobility with actors would allow for automated rebalancing.

Speed hit.

Too slow/wasteful.   Moving an actor isn’t needed if the each has its own heap.


3) support actor-based programs innately

With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise 

[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"


Promises and notifications are fine.  Both happen in Pony-Orca.  But the promises don’t fix the big performance problems.

Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?

It’s not enough.  We still have the big pauses from GCs in a large heap.

4) guarantee no data-races

The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this.


Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.  


E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.


Speed hit.

No blocking and no write barriers exist in Pony-Orca.  You can’t wait.  If you need to “wait,” you set a timer and respond to the event when the timer fires.    


Imagine a cloud based compute engine, processing Cassandra events that uses inter-machine actors to process the massively parallel Cassandra database. Inter-thread communication is not sufficient as there are hundreds of separate nodes.


Yes; I didn’t claim otherwise.  The networked version is coming.  See above.   My point is that the ‘remote’ characterization is not needed.  It’s not helping us describe and understand. 


Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.


No new design is needed for concurrency and interfacing.  There is much to implement, however.


The design is already done, modulo the not-yet-present network extension.  Interfacing between actors is always by async messaging.  Messaging will work as transparently as possible in the networked version across machine nodes.  


The issue is how most efficiently to use Orca, which happens to be working in Pony.  Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product.  Pony is a convenient way to study and use a working implementation of Orca.  Ergo, use Pony, even if we only study it as a good example of how to use Orca.  Some tweaks (probably a lot of them) could allow use of dynamic types.  We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime.  I’m not sure about that.  You know the VM better than I.  (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path. 

I don’t see how you form that conclusion.  I’ve not chosen yet.

You stated you are not thrilled with using Pony.


I don’t like the Pony language syntax.  I don’t like anything that looks like Algo-60.  Pony is a language, compiler and runtime implementing Orca.  The other stuff is good.  And I’ve not had much time to use it; I suspect I could like it more.



If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++.  The C++ for the Pony compiler and runtime, however, already compiles and works well.  Thus, starting the work in C++ is somewhat tempting.  Can someone explain the limits of how the VM simulator can be used?  How much VM core C is not a part of what can be compiled from Smalltalk?  Can all VM C code be compiled from Smalltalk?


Can someone answer the above question?

[1] Cog Blog - http://www.mirandabanda.org/cogblog/
[2] Smalltalk, Tips 'n Tricks - https://clementbera.wordpress.com/
[3] Capability Computation - http://erights.org/elib/capability/index.html
[4] Concurrency (Event Loops) - http://erights.org/elib/concurrency/index.html
[5] Distributed Programming - http://erights.org/elib/distrib/index.html




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200421/66e0e4cd/attachment-0001.html>

More information about the Vm-dev mailing list