[Vm-dev] An event driven Squeak VM

Wed Nov 11 19:57:57 UTC 2009

2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>
>
>
> On Wed, Nov 11, 2009 at 11:16 AM, Igor Stasenko <siguctua at gmail.com> wrote:
>>
>> 2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>> >
>> >
>> >
>> > On Wed, Nov 11, 2009 at 10:20 AM, Igor Stasenko <siguctua at gmail.com> wrote:
>> >>
>> >> 2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Nov 10, 2009 at 9:59 PM, Igor Stasenko <siguctua at gmail.com> wrote:
>> >> >>
>> >> >> 2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Nov 10, 2009 at 6:45 PM, John M McIntosh <johnmci at smalltalkconsulting.com> wrote:
>> >> >> >>
>> >> >> >> On 2009-11-10, at 6:17 PM, Eliot Miranda wrote:
>> >> >> >>
>> >> >> >>> With the threaded Squeak VM I'm working on one can go one better and have a number of image-level processes that block in the FFI and a number of worker threads in the VM that block on OS semaphores waiting for the VM to give them something to do.
>> >> >> >>
>> >> >> >> Obviously now you have to give a bit more details on this. Is it like the hydra VM? Or entirely different?
>> >> >> >
>> >> >> > Orthogonal, in that it might work well with Hydra.  The basic scheme is to have a natively multi-threaded VM that is not concurrent.  Multiple native threads share the Vm such that there is only one thread running Vm code at any one time.  This the VM can make non-blocking calls to the outside world but neither the VM nor the image need to be modified to handle true concurrency.  This is the same basic architecture as in the Strongtalk and V8 VMs and notably in David Simmons' various Smalltalk VMs.
>> >> >> > The cool thing about the system is David's design.  He's been extremely generous in explaining to me his scheme, which is extremely efficient.  I've merely implemented this scheme in the context of the Cog VM.  The idea is to arrange that a threaded callout is so cheap that any and all callouts can be threaded.  This is done by arranging that a callout does not switch to another thread, instead the thread merely "disowns" the VM.  It is the job of a background heartbeat thread to detect tat a callout is long-runnijng and that the VM has effectively blocked.  The heartbeat then activates a new thread to run the VM and the new thread attempts to take ownership and will run Smalltalk code if it succeeds.
>> >> >> > On return form a callout a thread must attempt to take ownership of the VM, and if it fails, add itself to a queue of threads waiting to take back the VM and then wait on an OS semaphore until the thread owning the VM decides to give up ownership to it.
>> >> >> > Every VM thread has a unique index.  The vmOwner variable holds the index of the owning thread or 0 if the VM is unowned.  To disown the VM all a thread has to do is zero vmOwner, while remembering the value of vmOwner in a temporary.  To take ownership a thread must use a low-level lock to gain exclusive access to vmOwner, and if vmOwner is zero, set it back to the thread's index, and release the lock.  If it finds vmOwner is non-zero it releases the lock and enters the wanting ownership queue.
>> >> >> > In the Cog VM the heartbeat beats at 1KHz, so any call that takes less than 0.5ms is likely to complete without the heartbeat detecting that the VM is blocked.  So any and all callouts can be threaded.  Quite brilliant.  All the work of changing the active process when switching between threads is deferred from callout time to when a different thread takes ownership of the VM, saving the VM state for the process that surrendered the VM and installing its own.
>> >> >> > The major wrinkle in this is that in David's VM he has a pinning garbage collector which arranges that any arguments passed out through the FFI are implicitly pinned.  We don't yet have a pinning garbage collector.  I do plan to do one.  But in the interim one quick hack, a neat idea of Andreas', is to fail calls that attempt to pass objects in new space, allowing only old objects to be passed, and to prevent the full garbage collector from running while any threaded calls are in progress.
>> >> >> > Having cheap non-blocking calls allows e.g.
>> >> >> > - the Hydra inter-VM channels to be implemented in Smalltalk code above the threaded FFI
>> >> >> > - socket calls to be blocking calls in the image
>> >> >> > - Smalltalk code to call select/poll/WaitForMultipleEvents
>> >> >> > There are still plenty of sticky issues to do with e.g. identifying threads that can do specific functions, such as the UI thread, and issuing OpenGL calls from the right thread, etc, etc.  But these are all doable, if potentially tricky to get right.  If this kind of code does migrate from the VM innards up to the image I think that's a really good thing (tm) but one will really have to know what one is doing to get it right.
>> >> >> > HTH
>> >> >> > eliot
>> >> >>
>> >> >> I used a mutex in Hydra (each interpreter has own mutex), so any
>> >> >> operation, which requires synchronization should be performed
>> >> >> only after obtaining the mutex ownership.
>> >> >> And sure, if crafted carefully, one could release the mutex before
>> >> >> doing an external call, and "try" get it back again after call
>> >> >> completed.
>> >> >> If use mutexes, provided by OS, then you don't need a heartbeat
>> >> >> process, obviously because you can simply put wait on mutex. So, i
>> >> >> suppose you introducing the heardbeat to minimize the overhead of
>> >> >> using synchronization primitives provided by OS, and instead using a
>> >> >> low-level assembly code.
>> >> >>
>> >> >> Just one minor thing - you mentioned the table of threads. What if
>> >> >> some routine creating a new thread, which get unnoticed by VM, so its
>> >> >> not registered in the VM 'threads' table,  but then such thread
>> >> >> attempts to obtain an ownership on interpreter somehow?
>> >> >
>> >> > This can only happen on a callback or other well-defined entry-point.  At these well-defined entry-points the VM checks whether there is a tag in thread-local storage (the thread's VM index).  If it is not set the VM allocates the necessary per-thread storage, assigns an index and allows the thread to continue.  On return from the entry-point the VM deallocates the storage, clears the thread-local storage and returns.
>> >> >
>> >>
>> >> Yes. Just to make sure everything is ok with that :)
>> >>
>> >> >>
>> >> >> About inter-image communication in Hydra. The main problem that you
>> >> >> need to pass a buffer between heads, so you need to get a lock on a
>> >> >> recepient, while still keeping a lock on sender interpreter. But this
>> >> >> could lead to deadlock, if recepient in own turn attempts to do the
>> >> >> same.
>> >> >> So, the solution, unfortunately, is to copy buffer to C heap (using
>> >> >> malloc().. yeah :( ), and pass an event with pointer to such buffer,
>> >> >> which then could be handled by recepient as soon as it ready to do so,
>> >> >> in event handling routine.
>> >> >
>> >> > But you could connect the two with a pair of pipes, right?  Then al that locking and buffer allocation is in the VM.  Or rather, once you have a non-blocking FFI you can just use an OS's native stream-based inter-process communications facilities.
>> >> >
>> >>
>> >> of course i could. but the task is to minimize the overhead, possibly
>> >> even without buffer copy overhead (that where pinning GC would be
>> >> really helpfull). i don't think that OS facilities not copying data
>> >> buffer to secure location before passing it between the sockets.
>> >> Because once it releases the sender, while still waiting receiver to
>> >> be ready to retrieve the data, it can't guarantee that given buffer
>> >> will not be used for something else, hence it inevitable should either
>> >> copy buffer contents to secure location or block the sender.
>> >
>> > OK, instead one can create a buffer from within Smalltalk (e.g. via Alien) and then create OS semaphores and use blocking calls to wait on those semaphores.  All I'm really trying to say is that once you have a threaded FFI you can move lots of stuff up out of the VM.  The disadvantage is that one loses platform independence, but I've long thought that Smalltalk class hierarchies are a much nicer way of creating cross-platform abstractions than great gobs iof platform-specific C code.
>> > Andreas counters that implementing the abstractions in the VM keeps them well-defined and free from meddling.  But that runs counter to the philosophy of an open system and preventing inadvertent meddling is something Smalltalk has to do anyway  (e.g. "Process should not be redefined, proceed to store over it").  The nice things about shooting oneself in the foot by meddling with a Smalltalk system are that a) it doesn't really do any harm and b) the debugging of it can be a great learning experience.
>> >
>> >> >>
>> >> >> One more thing:
>> >> >>  socket calls to be blocking calls in the image
>> >> >>
>> >> >> Assuming that VM use blocking sockets, then call will block the thread
>> >> >> & some of the image-side process.
>> >> >> Then hearbeat thread at some point sees that VM has no owning thread
>> >> >> and so, allows another thread, waiting in the queue to take ownership
>> >> >> on VM.
>> >> >> But what if there is no such thread? There is a choice: allocate new
>> >> >> native thread and let it continue running VM, or just ignore &  skip
>> >> >> over for the next heat beat.
>> >> >> I'd like to hear what you choose. Because depending from direction
>> >> >> taken, on server image, which simultaneously serves, say 100
>> >> >> connections you may end up either with 100 + 1 native threads, or less
>> >> >> (fixed) number of them but with risk to unable to run any VM code
>> >> >> until some of the blocking calls completes.
>> >> >
>> >> >  There is a simple policy that is a cap on the total number of threads the VM will allocate.  below this a new thread is allocated.  At the limit the VM will block.  But note that the pool starts at 1 and only grows as necessary up to the cap.
>> >> >>
>> >> >> I'd like to note that either of above alternatives having a quite bad
>> >> >> scalability potential.
>> >> >> I'd prefer to have a pool of threads, each of them serving N
>> >> >> connections. The size of threads pool should be 2x-3x number of
>> >> >> processor cores on host, because making more than that will not make
>> >> >> any real difference, since single core can serve only single native
>> >> >> thread while others will just consume the memory resources, like
>> >> >> address space etc.
>> >> >
>> >> > That's very similar to my numbers too.  My current default is at least two threads and no more than 32, and 2 x num processors/cores in between.  But these numbers should be configurable.  This is just to get started.
>> >>
>> >> Yes, but blocking sockets won't allow you to distribute load evenly
>> >> when number of threads less than number of active sockets. All active
>> >> connections should be distributed evenly among worker threads, that
>> >> will guarantee that you consuming computing resources optimally.
>> >
>> > So in that case one needs more threads, and to support them one needs more memory.  But it shouldn't be a surprise that one needs more resources to support a higher workload. Yer takes yer choice and yer pays yer price.  And of course one can providing settings to control the per-thread stack space etc.
>> >
>> >> And what about scheduling? Have you considered my idea to move
>> >> scheduling to language side, while on VM side, leave
>> >> very small portion (in amount of code & logic) for switching the
>> >> active processes?
>> >> I think that with introduction of JIT the overhead of language-side
>> >> sheduling will be quite small and quite acceptable given that it
>> >> allows us to change things whenever we want, without touching VM.
>> >
>> > No I haven't considered this because at least in my implementation, the scheduler and the thread manager are intimately connected.  Once a process is in a callout on a particular thread it is bound to that thread and if it calls back it'll run on that thread and only on that thread until the callout unwinds.  Further, one will be able to bind processes to specific threads to ensure that certain activities happen from a given native thread (e.g. OpenGL and the Windows debugging API both require that calls are issued from a single thread).  So the thread manager and scheduler cooperate to bind processes to threads an to arrange that the right thread switches occur when process switches require them.  Figuring out how to do that with a Smalltalk-level scheduler is more than I can manage right now :)
>>
>> Hmm.. i'm not sure that binding a process to thread will guarantee
>> that all callouts will be made from same thread.
>> Consider the code (presumably using some external API, which should be
>> used only from main thread), which runs within the callback:
>>
>> [ gl doSomething ] fork
>>
>> suppose that active process is bound to particular native thread, but
>> the process which is forked - not. But the problem that you must also
>> bind a forked process to be run using same thread as a process which
>> created it otherwise you having virtually no guarantees that some call
>> will be made using wrong thread.
>> But you can't predict what forked process does, it may do wrong calls
>> or may not , so binding the forked process to same thread is very
>> pessimistic choice.
>>
>> Instead, why not expose the new VM abilities to language, so one could
>> tell that specified callout should use a specified thread. Something
>> like:
>>
>> threadHandle := Smalltalk currentThreadId.
>>
>> externalFunction call: { arguments } inThread: threadHandle
>>
>> while by default
>>
>> externalFunction call: { arguments }
>>
>> will be free to run in any thread which is controlled by VM.
>
> Because that'll deadlock if the thread is doing something else.  Dedicating a thread to a process can avoid that.  e.g. spawn a process and bind it to a thread.  wrap the process in an API object.  The external API methods pass in requests via blocks to the hidden server process.  AFAICT Croquet uses this kind of approach to send messages between processes.

well i think that its easy to think out the solution how to tell a
callout to be performed in specific thread.
And i don't think that if you having such control at language level it
will make any difference, because you still have to deal with it
anyways, even with model which you currently employing.

>>
>> Then you are freed from implementing a complex and quite fragile (as
>> to me) logic on VM side which magically attempts to keep all horses
>> full-fed :)
>
> The mechanisms I've implemented don't look that fragile to me.  But its too early to report any real experience.  You could be right, but what I've done so far makes sense to me :)

I am right ;) , because from interpreter's point of view, switching to
different Process means just switching an active context, which is a
quite regular procedure for VM - interpreter switching an active
context all the times at each message send and return!
So, ask yourself, why VM would want to know something extra except the
context it should continue interpreting from?

Potentially a cost of switching an active process == cost of message
send. But of course, all logic which needs to pick what context the
interpreter should switch to may introduce some overhead.
But i prefer to have such logic implemented at language side, instead
to be frozen inside VM.

>>
>> >>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.

-- 
Best regards,
Igor Stasenko AKA sig.