Multy-core CPUs

Thu Oct 18 17:46:09 UTC 2007

On Oct 18, 2007, at 10:01 AM, Joshua Gargus wrote:

>
> On Oct 18, 2007, at 9:06 AM, Robert Withers wrote:
>
>>
>> On Oct 17, 2007, at 11:12 PM, Hans-Martin Mosner wrote:
>>
>>> This would only make things more complicated since then the  
>>> primitives
>>> would have to start parallel native threads working on the same  
>>> object
>>> memory.
>>> The problem with native threads is that the current object memory  
>>> is not
>>> designed to work with multiple independent mutator threads. There  
>>> are GC
>>> algorithms which work with parallel threads, but AFAIK they all have
>>> quite some overhead relative to the single-thread situation.
>>>
>>> IMO, a combination of native threads and green threads would be  
>>> the best
>>> (although it still has the problem of parallel GC):
>>> The VM runs a small fixed number of native threads (default:  
>>> number of
>>> available cores, but could be a little more to efficiently handle
>>> blocking calls to external functions) which compete for the runnable
>>> Smalltalk processes. That way, a number of processes could be  
>>> active at
>>> any one time instead of just one. The synchronization overhead in  
>>> the
>>> process-switching primitives should be negligible compared to the
>>> overhead needed for GC synchronization.
>>
>> This is exactly what I have started work on.  I want to use the  
>> foundations of SqueakElib as a msg passing mechanism between  
>> objects assigned to different native threads.  There would be one  
>> native thread per core.  I am currently trying to understand what  
>> to do with all of the global variables used in the interp loop, so  
>> I can have multiple threads running that code.  I have given very  
>> little thought to what would need to be protected in the object  
>> memory or in the primitives.  I take this very much as a learning  
>> project.  Just think, I'll be able to see how the interpreter  
>> works, the object memory, bytecode dispatch, primitives....all of  
>> it in fact.  If I can come out with a working system that does msg  
>> passing, even at the cost of poorly performing object memory, et  
>> al., then it will be a major success for me.
>>
>> It is going to be slower, anyway, because I have to intercept each  
>> msg send as a possible non-local send.
>
> Isn't this a show-stopper for a practical system?

Probably.  Although, if a single thread executes code slower than  
current squeak, yet all threads together generate higher throughput,  
then it's to be considered faster.

> Or is this a stepping-stone?

It is a stepping-stone to see what inter-thread messaging looks like  
and behaves.

> If so, how do you envision resolving this in the future?

My thinking is that getting the messaging working is the first step,  
followed by looking at synchronization problems, and then looking at  
what things like Exupery may offer to speed things up.

The example I gave of MacroTransforms is telling.  Currently an  
#ifTrue: message is macro transformed into bytecodes that do the  
#ifTrue: inline.  I have had to back that out so the #ifTrue: can be  
intercepted if the receiver is non-local.  At runtime, it would be  
nice to see that if the receiver is in fact local, then some form of  
inlining could be used, otherwise intercept.  Since this is runtime  
selected bytecodes, I thought of Exupery.

I think there could be lots of interesting optimization work if the  
basic system if functional.

>
> FWIW, Croquet was at one time envisioned to work in the way that  
> you describe.  The architects weren't able to produce a  
> satisfactory design/implementation within the necessary time frame,  
> and instead developed the current "Islands" mechanism.  This has  
> worked out very well in practice, and there is no pressing need to  
> try to implement the original idea.

I didn't know that, that's cool.  Islands is neat.

>
> In my understanding, Croquet islands and E vats are quite similar  
> in that regard (and the latter informed the design of the  
> former)... both use an explicit far-ref proxy to an object in  
> another island/vat.  What is the motivation for the approach you  
> have chosen, other than it being a fun learning process (which may  
> certainly be a good enough reason on its own)?

As I described above, maybe it's a stepping-stone.  Having a thread- 
based vat, means there are resolved refs like NearRef (same thread),  
ThreadRef (same process/mem, different thread), possibly ProcessRef  
(different process, uses pipes), FarRef (on the net).

I'm not very experienced with the vm/object memory, so this is also a  
fun learning experience!

Cheers,
Rob

>
> Cheers,
> Josh
>
>
>> To this end, the Macro Transforms had to be disabled so I could  
>> intercept them.  The system slowed considerably.  I hope to speed  
>> them up with runtime info: is the receiver in the same thread  
>> that's running?
>>
>> I do appreciate your comments and know that I may be wasting my  
>> time.  :)
>>
>>>
>>> The simple yet efficient ObjectMemory of current Squeak can not  
>>> be used
>>> with parallel threads (at least not without significant  
>>> synchronization
>>> overhead). AFAIK, efficient algorithms require every thread to  
>>> have its
>>> own object allocation area to avoid contention on object  
>>> allocations.
>>> Tenuring (making young objects old) and storing new objects into old
>>> objects (remembered table) require synchronization. In other words,
>>> grafting a threadsafe object memory onto Squeak would be a major  
>>> project.
>>>
>>> In contrast, for a significant subset of applications (servers)  
>>> it is
>>> orders of magnitudes simpler to run several images in parallel.  
>>> Those
>>> images don't stomp on each other's object memory, so there is  
>>> absolutely
>>> no synchronization overhead. For stateful sessions, a front end can
>>> handle routing requests to the image which currently holds a  
>>> session's
>>> state, stateless requests can be handled by any image.
>>>
>>> Cheers,
>>> Hans-Martin
>>>
>>
>>
>
>