<div dir="ltr">Hi all,<div><br></div><div>Ronie, you said: </div><div><span style="font-size:12.8px"><i><br></i></span></div><div><span style="font-size:12.8px"><i>Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.</i></span></div><div><br></div><div>I know you're working with high performance video games. If you would introduce multi-threading in Squeak/Pharo, how would you do it ? Especially, do you have a design in mind that does not require to rewrite all the core libraries ?</div><div><br></div><div>To sum-up previous mails:</div><div><br></div><div>1) There's this idea of having a multiple images communicating together, each image on a different VM, potentially 1 native thread per image. I think there is work on-going in this direction through multiple frameworks. With a minimal image and a minimal VM, the cost of the pair image+VM remains quite cheap, already today <15Mb for the pair is possible. I believe this idea is great but does not solve entirely the problem.</div><div><br></div><div>2) Levente's idea is basically to share objects between images, the shared objects being read-only and lazily duplicated to worker images upon mutation to have low-memory footprint images on the same VM. I like the idea, I was thinking of stopping threads to mutate shared objects and to give the programmer the responsibility to define a set of shared objects that are not frequently mutated instead of duplication, and go later in the direction of shared writable memory.</div><div><br></div><div>3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.</div><div><br></div><div>4) I need to look into the Roar VM project again and Dave Ungar's work on multithreaded Smalltalk. I should contact again Stefan Marr I guess.</div><div><br></div><div>5) I didn't mention it earlier, but there's Eliot's work on ThreadedFFI to use multiple native threads when using FFI. It also solves part of the multi-threading problem.</div><div><br></div><div>Thanks for sharing ideas.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 31, 2017 at 4:25 AM, Ronie Salgado <span dir="ltr"><<a href="mailto:roniesalg@gmail.com" target="_blank">roniesalg@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br><div dir="ltr"><div><div><div><div><div>Hi all,<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It’s 35+ years ago but my last experience in very parallel systems left 

me convinced that the first thing you do is prioritise the inter-process

 communication and leave the ‘real work’ as something to do in the 

machine’s spare time. I had a Meiko Transputer Computing Surface when I 

was an IBM research fellow around about the time coal beds were being 

laid down.<br>

</blockquote><div>I agree. When having more than a dozen (or some dozens) of cores, having a shared memory starts becoming the biggest bottleneck. Super computers and clusters are not built in the way that a traditional machine is built. A big single computer with multiples CPUs is usually a NUMA machine (not uniform memory access). A cluster is composed of several nodes which are indenpendent computers that are connected via a very fast network, but the network connection it is still slower in comparison with shared memory communication. Each one of the nodes in a cluster could also be a NUMA machine.</div><br></div>For this kind of machines, threads are useless in comparison with inter process, and inter node communication. IPC is usually made via MPI (message passing interface), instead of using shared memory.<br><br></div>Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.<br><br></div>Best regards,<br></div>Ronie<br></div><div class="gmail_extra"><br><div class="gmail_quote">2017-01-30 23:33 GMT-03:00 Frank Shearar <span dir="ltr"><<a href="mailto:frank.shearar@gmail.com" target="_blank">frank.shearar@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 30 January 2017 at 17:15, Ben Coman <span dir="ltr"><<a href="mailto:btc@openinworld.com" target="_blank">btc@openinworld.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="m_6018324803854768637m_3399150283529860110gmail-"><br>

On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <<a href="mailto:bera.clement@gmail.com" target="_blank">bera.clement@gmail.com</a>> wrote:<br>

><br>

> Hi all,<br>

><br>

> Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...<br>

><br>

> For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).<br>

><br>

> For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.<br>

><br>

> Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.<br>

><br>

> What do you think ?<br>

><br>

> Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?<br>

><br>

> I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.<br>

<br>

</span>My naive idea is that lots might be simplified by having spawned<br>

cputhreads use a different bytecode set that enforces a functional<br>

style of programming by having no write codes.  While restrictive, my<br>

inspiration is that functional languages are supposedly more suited to<br>

parallelsim by having no shared state.  So all algorithms must work on<br>

the stack only</blockquote><div><br></div><div><div>No: functional languages often share state. It's just that they share _immutable_ state. Or if you prefer, you can't tell if two threads are accessing the same data, or merely identical data.</div><div><br></div><div>For example, in Erlang, messages bigger than 64kB are shared between processes on the same machine, because it's much more efficient to share a pointer.</div><div><br></div><div>To make things slightly more confusing, the rule is more generally "functions are APPARENTLY pure". In languages like Clojure or F#, it's quite acceptable to use locally mutable state, as long as no one gets to see you cheat.</div></div><div><br></div><div>(ML languages are capable of sharing mutable state, it's just that you have to opt into such things through "ref" or "mutable" markers on things.)</div><div><br></div><div>frank</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">, which may be simpler to managing multiple updaters to<br>

objectspace.  This may(?) avoid the need to garbage collect the 1000<br>

cputhreads since everything gets cleared away when the stack dies with<br>

the thread.  On the flip side, might not want to scan these 1000<br>

cputhreads when garbage collecting the main Image thread.  So these<br>

cputhreads might have a marshaling area that reference counts object<br>

accesses external to the thread, and the garbage collector only needs<br>

to scan that area.  Or alternatively, each cputhread maintains its own<br>

objectspace that pulls in copies of objects Spoon style.<br>

<br>

Would each cputhread need its own method cache?  Since the application<br>

may have a massive number of individually short lived calculations, to<br>

minimise method lookups perhaps a self-contained<br>

mini-objectspace/method-cache could be seeded/warmed-up by the single<br>

threaded main image, which is copied to each spawned cputhread with<br>

parameters passed to the first invoked function.<br>

<br>

Presumably a major use case for these multiple threads would be<br>

numeric calculations.  So perhaps you get enough bang for the buck by<br>

restricting cputhreads to operate only on immediate types?<br>

<br>

Another idea is for cputhreads to be written in Slang which is<br>

dynamically compiled and executes as native code, completely avoiding<br>

the complexity of managing multiple access to objectspace.<br>

<br>

cheers -ben<br>

</blockquote></div><br></div></div>

<br></blockquote></div><br></div>

<br></blockquote></div><br></div>