[Vm-dev] An event driven Squeak VM

Wed Nov 11 18:20:09 UTC 2009

2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>
>
>
> On Tue, Nov 10, 2009 at 9:59 PM, Igor Stasenko <siguctua at gmail.com> wrote:
>>
>> 2009/11/11 Eliot Miranda <eliot.miranda at gmail.com>:
>> >
>> >
>> >
>> > On Tue, Nov 10, 2009 at 6:45 PM, John M McIntosh <johnmci at smalltalkconsulting.com> wrote:
>> >>
>> >> On 2009-11-10, at 6:17 PM, Eliot Miranda wrote:
>> >>
>> >>> With the threaded Squeak VM I'm working on one can go one better and have a number of image-level processes that block in the FFI and a number of worker threads in the VM that block on OS semaphores waiting for the VM to give them something to do.
>> >>
>> >> Obviously now you have to give a bit more details on this. Is it like the hydra VM? Or entirely different?
>> >
>> > Orthogonal, in that it might work well with Hydra.  The basic scheme is to have a natively multi-threaded VM that is not concurrent.  Multiple native threads share the Vm such that there is only one thread running Vm code at any one time.  This the VM can make non-blocking calls to the outside world but neither the VM nor the image need to be modified to handle true concurrency.  This is the same basic architecture as in the Strongtalk and V8 VMs and notably in David Simmons' various Smalltalk VMs.
>> > The cool thing about the system is David's design.  He's been extremely generous in explaining to me his scheme, which is extremely efficient.  I've merely implemented this scheme in the context of the Cog VM.  The idea is to arrange that a threaded callout is so cheap that any and all callouts can be threaded.  This is done by arranging that a callout does not switch to another thread, instead the thread merely "disowns" the VM.  It is the job of a background heartbeat thread to detect tat a callout is long-runnijng and that the VM has effectively blocked.  The heartbeat then activates a new thread to run the VM and the new thread attempts to take ownership and will run Smalltalk code if it succeeds.
>> > On return form a callout a thread must attempt to take ownership of the VM, and if it fails, add itself to a queue of threads waiting to take back the VM and then wait on an OS semaphore until the thread owning the VM decides to give up ownership to it.
>> > Every VM thread has a unique index.  The vmOwner variable holds the index of the owning thread or 0 if the VM is unowned.  To disown the VM all a thread has to do is zero vmOwner, while remembering the value of vmOwner in a temporary.  To take ownership a thread must use a low-level lock to gain exclusive access to vmOwner, and if vmOwner is zero, set it back to the thread's index, and release the lock.  If it finds vmOwner is non-zero it releases the lock and enters the wanting ownership queue.
>> > In the Cog VM the heartbeat beats at 1KHz, so any call that takes less than 0.5ms is likely to complete without the heartbeat detecting that the VM is blocked.  So any and all callouts can be threaded.  Quite brilliant.  All the work of changing the active process when switching between threads is deferred from callout time to when a different thread takes ownership of the VM, saving the VM state for the process that surrendered the VM and installing its own.
>> > The major wrinkle in this is that in David's VM he has a pinning garbage collector which arranges that any arguments passed out through the FFI are implicitly pinned.  We don't yet have a pinning garbage collector.  I do plan to do one.  But in the interim one quick hack, a neat idea of Andreas', is to fail calls that attempt to pass objects in new space, allowing only old objects to be passed, and to prevent the full garbage collector from running while any threaded calls are in progress.
>> > Having cheap non-blocking calls allows e.g.
>> > - the Hydra inter-VM channels to be implemented in Smalltalk code above the threaded FFI
>> > - socket calls to be blocking calls in the image
>> > - Smalltalk code to call select/poll/WaitForMultipleEvents
>> > There are still plenty of sticky issues to do with e.g. identifying threads that can do specific functions, such as the UI thread, and issuing OpenGL calls from the right thread, etc, etc.  But these are all doable, if potentially tricky to get right.  If this kind of code does migrate from the VM innards up to the image I think that's a really good thing (tm) but one will really have to know what one is doing to get it right.
>> > HTH
>> > eliot
>>
>> I used a mutex in Hydra (each interpreter has own mutex), so any
>> operation, which requires synchronization should be performed
>> only after obtaining the mutex ownership.
>> And sure, if crafted carefully, one could release the mutex before
>> doing an external call, and "try" get it back again after call
>> completed.
>> If use mutexes, provided by OS, then you don't need a heartbeat
>> process, obviously because you can simply put wait on mutex. So, i
>> suppose you introducing the heardbeat to minimize the overhead of
>> using synchronization primitives provided by OS, and instead using a
>> low-level assembly code.
>>
>> Just one minor thing - you mentioned the table of threads. What if
>> some routine creating a new thread, which get unnoticed by VM, so its
>> not registered in the VM 'threads' table,  but then such thread
>> attempts to obtain an ownership on interpreter somehow?
>
> This can only happen on a callback or other well-defined entry-point.  At these well-defined entry-points the VM checks whether there is a tag in thread-local storage (the thread's VM index).  If it is not set the VM allocates the necessary per-thread storage, assigns an index and allows the thread to continue.  On return from the entry-point the VM deallocates the storage, clears the thread-local storage and returns.
>

Yes. Just to make sure everything is ok with that :)

>>
>> About inter-image communication in Hydra. The main problem that you
>> need to pass a buffer between heads, so you need to get a lock on a
>> recepient, while still keeping a lock on sender interpreter. But this
>> could lead to deadlock, if recepient in own turn attempts to do the
>> same.
>> So, the solution, unfortunately, is to copy buffer to C heap (using
>> malloc().. yeah :( ), and pass an event with pointer to such buffer,
>> which then could be handled by recepient as soon as it ready to do so,
>> in event handling routine.
>
> But you could connect the two with a pair of pipes, right?  Then al that locking and buffer allocation is in the VM.  Or rather, once you have a non-blocking FFI you can just use an OS's native stream-based inter-process communications facilities.
>

of course i could. but the task is to minimize the overhead, possibly
even without buffer copy overhead (that where pinning GC would be
really helpfull). i don't think that OS facilities not copying data
buffer to secure location before passing it between the sockets.
Because once it releases the sender, while still waiting receiver to
be ready to retrieve the data, it can't guarantee that given buffer
will not be used for something else, hence it inevitable should either
copy buffer contents to secure location or block the sender.

>>
>> One more thing:
>>  socket calls to be blocking calls in the image
>>
>> Assuming that VM use blocking sockets, then call will block the thread
>> & some of the image-side process.
>> Then hearbeat thread at some point sees that VM has no owning thread
>> and so, allows another thread, waiting in the queue to take ownership
>> on VM.
>> But what if there is no such thread? There is a choice: allocate new
>> native thread and let it continue running VM, or just ignore &  skip
>> over for the next heat beat.
>> I'd like to hear what you choose. Because depending from direction
>> taken, on server image, which simultaneously serves, say 100
>> connections you may end up either with 100 + 1 native threads, or less
>> (fixed) number of them but with risk to unable to run any VM code
>> until some of the blocking calls completes.
>
>  There is a simple policy that is a cap on the total number of threads the VM will allocate.  below this a new thread is allocated.  At the limit the VM will block.  But note that the pool starts at 1 and only grows as necessary up to the cap.
>>
>> I'd like to note that either of above alternatives having a quite bad
>> scalability potential.
>> I'd prefer to have a pool of threads, each of them serving N
>> connections. The size of threads pool should be 2x-3x number of
>> processor cores on host, because making more than that will not make
>> any real difference, since single core can serve only single native
>> thread while others will just consume the memory resources, like
>> address space etc.
>
> That's very similar to my numbers too.  My current default is at least two threads and no more than 32, and 2 x num processors/cores in between.  But these numbers should be configurable.  This is just to get started.

Yes, but blocking sockets won't allow you to distribute load evenly
when number of threads less than number of active sockets. All active
connections should be distributed evenly among worker threads, that
will guarantee that you consuming computing resources optimally.

And what about scheduling? Have you considered my idea to move
scheduling to language side, while on VM side, leave
very small portion (in amount of code & logic) for switching the
active processes?
I think that with introduction of JIT the overhead of language-side
sheduling will be quite small and quite acceptable given that it
allows us to change things whenever we want, without touching VM.

>>
>> >>
>> >>
>> >> --
>> >> ===========================================================================
>> >> John M. McIntosh <johnmci at smalltalkconsulting.com>   Twitter:  squeaker68882
>> >> Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
>> >> ===========================================================================
>> >>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>
>
>

-- 
Best regards,
Igor Stasenko AKA sig.