[Vm-dev] Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Tue Jan 31 08:51:12 UTC 2017

Hi all,

Ronie, you said:

*Threads are more useful when one needs high performance and low latency in
an application that runs in a single computer. High performance video games
and (soft) realtime graphics are usually in this domain.*

I know you're working with high performance video games. If you would
introduce multi-threading in Squeak/Pharo, how would you do it ?
Especially, do you have a design in mind that does not require to rewrite
all the core libraries ?

To sum-up previous mails:

1) There's this idea of having a multiple images communicating together,
each image on a different VM, potentially 1 native thread per image. I
think there is work on-going in this direction through multiple frameworks.
With a minimal image and a minimal VM, the cost of the pair image+VM
remains quite cheap, already today <15Mb for the pair is possible. I
believe this idea is great but does not solve entirely the problem.

2) Levente's idea is basically to share objects between images, the shared
objects being read-only and lazily duplicated to worker images upon
mutation to have low-memory footprint images on the same VM. I like the
idea, I was thinking of stopping threads to mutate shared objects and to
give the programmer the responsibility to define a set of shared objects
that are not frequently mutated instead of duplication, and go later in the
direction of shared writable memory.

3) Ben's idea is to create a process in a new thread that cannot mutate
objects in memory. I have issues with this design because each worker
thread as you say have to work only with the stack, hence they cannot
allocate objects, hence they cannot use closures.

4) I need to look into the Roar VM project again and Dave Ungar's work on
multithreaded Smalltalk. I should contact again Stefan Marr I guess.

5) I didn't mention it earlier, but there's Eliot's work on ThreadedFFI to
use multiple native threads when using FFI. It also solves part of the
multi-threading problem.

Thanks for sharing ideas.

On Tue, Jan 31, 2017 at 4:25 AM, Ronie Salgado <roniesalg at gmail.com> wrote:

>
> Hi all,
>
>
>> It’s 35+ years ago but my last experience in very parallel systems left
>> me convinced that the first thing you do is prioritise the inter-process
>> communication and leave the ‘real work’ as something to do in the machine’s
>> spare time. I had a Meiko Transputer Computing Surface when I was an IBM
>> research fellow around about the time coal beds were being laid down.
>>
> I agree. When having more than a dozen (or some dozens) of cores, having a
> shared memory starts becoming the biggest bottleneck. Super computers and
> clusters are not built in the way that a traditional machine is built. A
> big single computer with multiples CPUs is usually a NUMA machine (not
> uniform memory access). A cluster is composed of several nodes which are
> indenpendent computers that are connected via a very fast network, but the
> network connection it is still slower in comparison with shared memory
> communication. Each one of the nodes in a cluster could also be a NUMA
> machine.
>
> For this kind of machines, threads are useless in comparison with inter
> process, and inter node communication. IPC is usually made via MPI (message
> passing interface), instead of using shared memory.
>
> Threads are more useful when one needs high performance and low latency in
> an application that runs in a single computer. High performance video games
> and (soft) realtime graphics are usually in this domain.
>
> Best regards,
> Ronie
>
> 2017-01-30 23:33 GMT-03:00 Frank Shearar <frank.shearar at gmail.com>:
>
>>
>> On 30 January 2017 at 17:15, Ben Coman <btc at openinworld.com> wrote:
>>
>>>
>>> On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <bera.clement at gmail.com>
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Tim's just shared this lovely article with a 10,000+ core ARM machine.
>>> With this kind of machines, it's a bit stupid to use only 1 core when you
>>> have 10,000+. I believe we have to find a way to introduce multi-threading
>>> in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic
>>> cards, I guess it's ok not to use them because their not general purpose
>>> processors while the VM is general purpose, but all those 10,000 cores...
>>> >
>>> > For parallel programming, we could consider doing something cheap like
>>> the parallel C# loops (Parallel.for and co). The Smalltalk programmer would
>>> then explicitly write "collection parallelDo: aBlock" instead of
>>> "collection do: aBlock", and if the block is long enough to execute, the
>>> cost of parallelisation becomes negligible compared to the performance
>>> boost of parallelisation. The block has to perform independent tasks, and
>>> if multiple blocks executed in parallel read/write the same memory
>>> location, as in C#, the behavior is undefined leading to freezes / crashes.
>>> It's the responsibility of the programmer to find out if loop iterations
>>> are independent or not (and it's not obvious).
>>> >
>>> > For concurrent programming, there's this design from E where we could
>>> have an actor model in Smalltalk where each actor is completely independent
>>> from each other, one native thread per actor, and all the common objects
>>> (including what's necessary for look-up such as method dictionaries) could
>>> be shared as long as they're read-only or immutable. Mutating a shared
>>> object such as installing a method in a method dictionary would be detected
>>> because such objects are read-only and we can stop all the threads sharing
>>> such object to mutate it. The programmer has to keep uncommon the mutation
>>> of shared objects to have good performance.
>>> >
>>> > Both design have different goals using multiple cores (parallel and
>>> concurrent programming), but in both cases we don't need to rewrite any
>>> library to make Squeak / Pharo multi-threaded like they did in Java.
>>> >
>>> > What do you think ?
>>> >
>>> > Is there anybody on the mailing list having ideas on how to introduce
>>> threads in Squeak / Pharo in a cheap way that does not require rewriting
>>> all core/collection libraries ?
>>> >
>>> > I'm not really into multi-threading myself but I believe the Cog VM
>>> will die in 10 years from now if we don't add something to support
>>> multi-threading, so I would like to hear suggestions.
>>>
>>> My naive idea is that lots might be simplified by having spawned
>>> cputhreads use a different bytecode set that enforces a functional
>>> style of programming by having no write codes.  While restrictive, my
>>> inspiration is that functional languages are supposedly more suited to
>>> parallelsim by having no shared state.  So all algorithms must work on
>>> the stack only
>>
>>
>> No: functional languages often share state. It's just that they share
>> _immutable_ state. Or if you prefer, you can't tell if two threads are
>> accessing the same data, or merely identical data.
>>
>> For example, in Erlang, messages bigger than 64kB are shared between
>> processes on the same machine, because it's much more efficient to share a
>> pointer.
>>
>> To make things slightly more confusing, the rule is more generally
>> "functions are APPARENTLY pure". In languages like Clojure or F#, it's
>> quite acceptable to use locally mutable state, as long as no one gets to
>> see you cheat.
>>
>> (ML languages are capable of sharing mutable state, it's just that you
>> have to opt into such things through "ref" or "mutable" markers on things.)
>>
>> frank
>>
>>
>>> , which may be simpler to managing multiple updaters to
>>> objectspace.  This may(?) avoid the need to garbage collect the 1000
>>> cputhreads since everything gets cleared away when the stack dies with
>>> the thread.  On the flip side, might not want to scan these 1000
>>> cputhreads when garbage collecting the main Image thread.  So these
>>> cputhreads might have a marshaling area that reference counts object
>>> accesses external to the thread, and the garbage collector only needs
>>> to scan that area.  Or alternatively, each cputhread maintains its own
>>> objectspace that pulls in copies of objects Spoon style.
>>>
>>> Would each cputhread need its own method cache?  Since the application
>>> may have a massive number of individually short lived calculations, to
>>> minimise method lookups perhaps a self-contained
>>> mini-objectspace/method-cache could be seeded/warmed-up by the single
>>> threaded main image, which is copied to each spawned cputhread with
>>> parameters passed to the first invoked function.
>>>
>>> Presumably a major use case for these multiple threads would be
>>> numeric calculations.  So perhaps you get enough bang for the buck by
>>> restricting cputhreads to operate only on immediate types?
>>>
>>> Another idea is for cputhreads to be written in Slang which is
>>> dynamically compiled and executes as native code, completely avoiding
>>> the complexity of managing multiple access to objectspace.
>>>
>>> cheers -ben
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170131/f171d82c/attachment.html>