[Vm-dev] [Pharo-dev] Pony for Pharo VM

Fri May 8 05:06:42 UTC 2020

How do Pharo’s and Squeak’s VMs differ?  I thought OpenSmalltalkVM was the common VM.  I also read something recently from Eliot that seemed to indicate a fork.  

I thought Pharo had the new tools, like GT, but I’m not sure.  I don’t follow Squeak anymore.  

Pharo may, it is fast moving and they drop historical support as new tools come online. I don't follow Pharo anymore. There is a common VM but the builds are separate.

I mostly don’t follow Squeak, but do follow Pharo on and off, and may port as soon as the GUI formatting problems are fixed or are fixable by my own use of Spec2.   

I tried Squeak 5.3 a few days ago for the first time in 16 years.  It has a nicer induction/setup process, but menus were malfunctioning (rending spastically) before I finished getting reacquainted with the new surface.   I don’t have time these days to finish playing with it.  I may get back to it, but why do that if the Pharo GUI is more advanced?  Besides avoiding Pharo framework bloat and confusion, what about Squeak compels you to use it instead of Pharo for VM dev?

https://ponylang.zulipchat.com/#narrow/search/lzip

Here was the thread reference I was unable to follow. That you provided it around a discussion of why the CogVM was "not acceptable". Say what? So we are adding a near real-time requirement? I would suggest that the CogVM meets near real-time requirements. The longest GC pause may be 100 ms let us say. That is still near real-time.

5 to 10 ms is what I need.  Even Pony’s GCing barely keeps up with this, but the effect is smoother because all actors are running all the time, except when each actor GCs its own little heap.  The timing issue is more about smoothness and predictably small tail latencies at the right end of a very steep and narrow latencies distribution.  

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time.  The video details are fuzzy, but you can get a good feel for message flow. 

Exactly the way Smalltalk operates at runtime. Smalltalk was made with the benefit that the core message passing paradigm is the exact model of interaction we see remotely: message passing. Squeak is a native message passing machine.

This should have happened first in Smalltalk.  

It did.

No, the idea of asynchronous messaging did, but not the implementation.   We’re not discussing the same phenomenon.

Smalltalk does not in general have asynchronous messaging between all actors all the time.  It doesn’t even have actors by default in the core language. You have to design them as an afterthought.  That’s just wrong.

Smalltalk does not have true actors as a baseline implementation of the OO programming paradigm.  You have to engineer it if you want it, and it doesn’t scale well with the green threads.  Just non-blocking FFI doesn’t count; that is necessary and good, but not sufficient.

Async messaging:  that was the original vision, and it still works best for state-machine construction, because no blocking and no read/write-barriers are needed.  Here’s the gist, and you’ll find that Kay says the same in his talks, repeatedly (apparently no one listens and thinks about using it):  advance the state of the state-machine only by exchange of asynchronous messages between actors.  That’s the whole thing.  Then you have the tooling on top of that to make the SM building systematic, reliable, and pleasant.  That’s missing too, and must be done, as well, or the core idea is hard to use fully—which is largely why we program as we do today.  Most programmers are old dogs coding in the same old wrong way, because the new way, which is very much better, is even harder to do without the right tools and guarantees, and we don’t have those yet.  Functional programming (great for many domains) is a much better choice for general-purpose programming than the actor model with the current actor-based tools. 

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

 - 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms

Not acceptable.  Too long.

What is your near real-time requirement?

5 to 10 ms pauses per actor, not globally whilst all actors wait.  Think smooth.

 and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable.  Too long.  And, worse, it won’t scale.

I am unaware of any scaling problems. In Networking, 1000s of concurrent connections are supported. In computations, 10,000s objects. What are your timing requirements? Each incremental took a fraction of a millisecond to compute: 264 microseconds.

 The problem is not the percentage; it’s the big delays amidst other domain-specific computation.  These times must be much smaller and spread out across many pauses during domain-specific computations.

See the 307 incremental GCs? These are 264 microsecond delays spread out across domain-specific computations. 

We have to watch definitions and constraints carefully.  

Where memory management is concerned, this thread tries to compare the merits of the two extremes:  per-actor memory management, as in the Orca (Pony practically), and global stop-the-world (StW) collection as in a classical Smalltalk.  

You seem to be presenting something intermediate above, where there are segments that are GCed in turn.  Are you stopping all domain threads/messaging during the incremental GCs, or just the ones for objects in a certain section of the heap? Or, are the heap partitions divided by more traditional criteria, like object size and lifespan?  What is the spacing between the incremental GCs?  Steady frequency of pauses, smallness of pauses, and narrowness of distribution of longest pauses, especially, are the most important criteria.  

   No serious real-time apps can be made in this case.

Of course they can. Model the domain as resilient & accepting of 100 ms pauses, for full GCs. It may be more could be done to the CogVM for near real-time, I am not very knowledgeable about the VM.

We are discussing different app domains.

I can’t use 100 ms pauses in my real-time app.  I need sub 10 ms pauses.  Again even Pony needs better GCing, but the report I have from Rocco shows more or less acceptable pauses.  He was kind enough to run the program again with the GC stats turned on.   I’ll try to find the document and attach it.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale. 

Yeah, the video did not suggest anything other than using message passing. 

You can’t miss this detail; it’s almost everything that matters:  asynchronous message passing between all actors, all the time, on the metal, not as an afterthought, with a guarantee of no data-races.

I could not find the thread discussing GC. Would you please post the specific URL to get to that resource, please? I do not want to guess any longer.

I don’t get it.  It works for me in a newly opened tab.  I don’t know what the problem is.  Zulip should work for you as it does for me.  Login to your Zulip and search on ‘Rocco’.  You’ll see Rocco’s stuff, which is largely about the parsing app he has.  Better, just ask for the info you want.      

Why would the idea of ‘remote’ enter here?  The execution scope is an OS process.  Pony actors run on their respective threads in one OS process.  Message passing is zero-copy; all “passing” is done by reference.  No data is actually copied.

In the SqueakELib Capabilities model, between Vats (in-process, inter-process & inter-machine-node) most references to Actors are remote and then we have zero-copy. Sometimes we need to pass numbers/strings/collections and those are pass-by-copy.

Yeah, the copying won’t work well.  Just never do it, at least not in one OS process.  0-copy messaging is the rule in Pony.  Even this may not matter much eventually.  The MM is being reworked with the hope of eliminating all consensus-driven messaging, which can be expensive, even if only transiently, on highly mutable object sets.  See the Verona project, which seems to be slowly converging with Pony:  https://github.com/microsoft/verona/blob/master/docs/faq.md.  The Orca memory management core-weakness is still the fact that all messaging must be traced so that the collector knows which actors still refer to a given object.  This is true of all concurrent collectors.  That problem—and it’s a big problem-- goes away completely if the Pony runtime becomes Verona-ized.

  The scheduler interleaves all threads needing to share a core if there are more actors than cores.  Switching time for actor threads, in that case, is 5 to 15 ns.  This was mentioned before.  Opportunistic work stealing happens.  That means that all the cores stay as busy as possible if there is any work at all left to do.  All of this happens by design without intervention or thought from the programmer.  You can read about this in the links given earlier.  I suggest we copy the design for Smalltalk.

Which specific links? Could you send a summary email?

Not now, maybe later. 

Some of this data comes from lectures by Clebsch, but you’ll find most of the meat in the Pony papers, for which links can be found on the community page:  https://www.ponylang.io/community/ .  

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small).  A scheduler keeps all cores busy, running and interleaving all the current actor threads.  Message tracing maintains ref counts.  A cycle-detector keep things tidy.  Do Squeak and Pharo have those abilities?

In the case of remote capability references, there is reference counting. This occurs inside the Scope object, there are 6 tables: 2 for third party introduction (gift tables), 2 for outgoing references (#answers & #export) and 2 for incoming references (#questions & #imports). These tables manage all the remote reference counting. Once again between any two Vats (in-process, inter-process & inter-machine-node). There are 2 GC messages sent back from a remote node (GCAnswer & GCExport) for each of the outgoing references. Alice has a reference to a remote object in Bob, when the internal references to Alice's reference end and the RemoteERef is to be garbage collected a GC message is sent to the hosting Vat, Bob.

In Squeak/Pharo do all actors stop for the GCs, even for the smaller incremental ones?

I have some experience with state-machine construction to implement security protocols. In Squeak, DoIt to this script to load Crypto and ParrotTalk & SSL (currently broken) and see some state-machines:

Installer ss
    project: 'Cryptography'; install: 'ProCrypto-1-1-1';
    project: 'Cryptography'; install: 'ProCryptoTests-1-1-1';
    project: 'Cryptography'; install: 'CapabilitiesLocal';
    project: 'Oceanside'; install: 'ston-config-map';
    project: 'Cryptography'; install: 'SSLLoader;
    project: 'Cryptography'; install: 'Raven'.

Pony has Actors.  It also has Classes.  The actors have behaviours.  Think of these as async methods.  Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects.  Doing this last bit well is the task that concerns me most.

Which? "ref-caps that type the objects"? What does that mean?

This is the best page on ref-caps:  https://www.ponylang.io/learn/#reference-capabilities

The six ref caps define which objects can mutate which others at compile time.   Ref-caps provide the guarantees.  The ref caps connect the code of your language (Pony so far) to the MM runtime at compile time.

With the CapabilitiesLocal I pointed you to, we have an Actor model with async message passing to behaviors of an Actor. Squeak has Actors supporting remote references (3-way introductions, through the gift tables is broken. Remote references from Alice to Bob is working. See the tests in ThunkHelloWorldTest: #testConnectAES, #testConnectAESBufferOrdering & #testConnectAESBuffered.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler.  There are actors, mailboxes, message queues, and the scheduler, mainly.   You don’t need to be concerned about safety.  It’s been handled for you by the runtime and ref-caps.

Same with Raven, plus remote. We have all of that. See the PriorityVat class.

Last time I checked, Squeak/Pharo didn’t have actors that run without interruption by the GC.  Or can Smalltalk actors do that now?  How hard to implement on OSVM would a per-actor memory architecture be?   You need a concurrent collector algo to make it work.  I don’t think you have that in the current VM.  

I don’t have time for Pony programming these days--I can’t even read about these days.  Go ahead if you wish.

Your time is better spent in other ways, though.

I communicate with you about what could be, but I agree I must stay focused on my primary target, which is porting SSL to use my new ThunkStack framework for remote encrypted communications. End-to-end encryption is what I am about. Here is a visualization of what I aim for with TLS 1.3 and Signal as to be done projects, it is currently vaporware. I have ParrotTalk done and am working SSL, then I will move to SSH. The script I listed above will load all remote packages, except for SSH. I m attaching the flyer I created to broadcast Squeak's ProCrypto configuration.

The speed and scale advantages of Orca over the big-heap approach have been demonstrated.  That was done some time ago.   Read the paper by Clebsch and friends for details.  

Regarding capabilities please read the ELib documentation on ERights website: http://erights.org/elib/index.html

The material is not well organized, and is very hard to even want to read.  Maybe you are the one to change that.

I’m looking for:  1) definitions of all terms that are new or nonstandard; 2) problem-constraint invariants (one can’t reason about anything without invariants); 3) problem-solution objectives expressed in measurable terms, like data-rate or latency. 

Can I get those?  This might be a good time and place to describe your proposed concurrency solution for Smalltalk in terse, measurable terms.  

Have you measured a running actor-based app on one node?  On two or more?

You state that Pony is just to access Orca. 

The two are codesigned.

What makes Orca so great?

The measurements.  Read the end of the Orca paper if you can’t read the whole thing.  Orca is the best concurrent MM protocol now.  

Aside:  I want to be rid of general purpose GC.  I think the purely functional approach with persistent data structures can work better (think Haskell).  You still need to maintain state in processing queues (think Clojure; not sure how Haskell handles this).  You still need temporal-coherency control devices at the periphery for IO (atoms for example).      

There are definitely more than one Actor per Vat. 

Why have another construct?  It appears to be scoped to the node.  Is that the reason for the vat’s existence?  Does it control machine-node-specific messaging and resource management for the actors in it contains?

1 per process. There will be a finite number of Capability actors in each event loop. 

But more than one and scale into thousands per Vat.

Is there a formal definition of vat?  I found this:

“A vat is the part of the Neocosm implementation that has a unique network identity. We expect that normal circumstances, there will only be one vat running on a particular machine at one time. Neocom currently (28 May 1998) supports only one avatar per vat.”

Here is the state machine specification for ParrotTalk version 3.7, which is compiled by the ProtocolStateCompiler. This stateMap models states, triggers, transitions, default and callbacks and is simple to use.

ParrotTalkSessionOperations_V3_7 class>>#stateMap

    "(((ParrotTalkSessionOperations_v3_7 stateMap compile)))"

    | desc |
    desc := ProtocolStateCompiler initialState: #initial.
    (desc newState: #initial -> (#processInvalidRequest: -> #dead))
        add: #answer -> (nil -> #receivingExpectHello);
        add: #call -> (nil -> #receivingExpectResponse).
    (desc newState: #connected -> (#processInvalidRequest: -> #dead))
        addInteger: 7 -> (#processBytes: -> #connected).
    (desc newState: #dead -> (#processInvalidRequest: -> #dead)).

    (desc newState: #receivingExpectHello -> (#processInvalidRequest: -> #dead))
        addInteger: 16 -> (#processHello: -> #receivingExpectSignature).
    (desc newState: #receivingExpectSignature -> (#processInvalidRequest: -> #dead))
        addInteger: 18 -> (#processSignature: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).

    (desc newState: #receivingExpectResponse -> (#processInvalidRequest: -> #dead))
        addInteger: 17 -> (#processResponse: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).
    ^desc.

--Because it will continue the current Smalltalk-concurrency lameness.

The only identified difference is not the Actor model, it is the near real-time requirements on the Garbage Collector, yes? So what lameness do you reference?

The actor model is a problem too if you stop all actors with a StW GC.  Fix the problem by letting each actor run until it finishes processing the last message.  Then let it collect its own garbage.  Then let it take the next message from the queue.  All actors are single-threaded by definition.  This maximizes processing rate and smoothness of GC-disruption of domain work.  It also increases tracing overhead transiently when large numbers of mutable objects are used (21% peak CPU consumption ascribable to tracing when you throw the tree/ring exercise at Pony with as many mutable types as possible).  We will be turning to functional programming strategies for at least some (if not eventually all) core (not peripheral IO) parallelization efforts, but I digress somewhat.

Two big problems impede parallelization of programs in Smalltalk:  1) the GC stops all actors, all of them or large chunks of them at once, depending on how the GC works.  Neither situation is acceptable.  That this pause is small is not as important as the work lost from all the actors during that period;  2) core Smalltalk doesn’t have a guaranteed concurrency-integrity model that automatically starts threads on different cores; it can only interleave them on one core (green threading).   

These ideas should not be add-ons or frameworks.  They should be core features.  If we can do the above two in Squeak/Pharo, I’ll use Squeak/Pharo to work on a VM. 

Have you tried implementing SmallInteger class>>#tinyBenchmarks in Pony?

No, sadly.  I’ve yet to build Pony.  I’m still reading about the techniques and background.  I plan to parse large CSV files into objects, as in Rocco’s exercise.  I have some files I can use for that purpose, and can get the same data over HTTP to test a connection too.  That would be a nice first experiment.  We need to be clinical about this.  All the talk and hand-waving is an okay start, but at some point, we must measure and keep on doing that in a loop, as architecture is tweaked.  I would like to compare a Pony parsing program to my VW parsing program as a start.  But the Pony work has to wait.

Too slow/wasteful.   Moving an actor isn’t needed if the each has its own heap.

In ELib, why not allow an Actor to be mobile and move from Alice's Vat to Bob's Vat?

Are Vats for scoping actors to specific machine nodes?  If so, then yes move the actors to another machine node if it is better suited to what the actor does and needs. 

Then automated management apps can really truly rebalance Actors. Only on a rare moment, not for every call.

Yes, we must have run-time feedback to adjust resources, which are the actors and the machine nodes they are made to run on.  A coordination actor would watch all other actors and clusters of them.

Once again the near real-time of the Orca Garbage Collector. I am not convinced, but reading some papers. 

Runtime action with an Orca collector is definitely smoother.   This we can know before the fact (before we measure) from the fine-grain MM.  What is not clear without testing is overall speed.  We need to test and measure.

86 ms will not break the contract, I propose.

It’s okay for a business app, certainly--not such much for machine-control or 3D graphics.  It’s good enough for prototyping a 3D sim, but not for deployment.  

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca. 

I don't watch for it, but it is a strong design consideration. Keep Actor behaviors short and sweet.

The cycles are about chaining of messages/dependencies between actors.  When a programmer makes these actor clusters manually, messing up is easy.   A higher-level tools is need to manage these constructions.  The actor-cycles will exist and become problematic even if each actor is fast.  The detector finds and exposes them to the programmer.  

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

???

This constraint is not needed to guarantee correct concurrency in Pony, which happens at compile time and takes no CPU cycles.   The approach above sounds dynamic; there are some CPU cycles involved.

Yes; I didn’t claim otherwise.  The networked version is coming.  See above.   My point is that the ‘remote’ characterization is not needed.  It’s not helping us describe and understand. 

It does so for me, either we have inner-Vat message calling, immediate, adding to the stack. And we have inter-Vat message sending, asynchronous, adding to the queue.

I still don’t see a clear definition of vat.

None of the above language is needed when the concurrency scheme is simpler and doesn’t use those ideas and devices.  

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

The design is already done, modulo the not-yet-present network extension.  Interfacing between actors is always by async messaging.  Messaging will work as transparently as possible in the networked version across machine nodes. 

Must remember, we are still vulnerable to network failure errors.

Yes, they are keenly aware of that.  It’s a big task and won’t happen for a while.  But it will happen.

I don’t like the Pony language syntax.  I don’t like anything that looks like Algo-60.  Pony is a language, compiler and runtime implementing Orca.  The other stuff is good.  And I’ve not had much time to use it; I suspect I could like it more.

No argument from me! Squeak is a language, a compiler and a profound image-based runtime. Is there another language with such an image-based runtime? I think not.

Yes, we all love Smalltalk.  It’s still too slow. 

We’re not talking about coding ergonomics and testing dynamic.  We all agree that Smalltalk is a better that way to code and interact with the evolving program.  But that’s not what the thread is about.   This is all about speed and scale across cores, CPUs, and machine nodes.  The solution must be implemented close to the metal (VM).  It can’t be an add-on framework.  We need an Actor class and syntax for making actors and their async behaviours.  Then we need the VM to understand the new format and bytecodes associated with actors.

Shaping

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200508/3cb6fa84/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 36325 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200508/3cb6fa84/attachment-0001.jpg>