Multy-core CPUs

Peter William Lount peter at smalltalk.org
Thu Oct 25 21:08:04 UTC 2007


Hi,


Sebastian Sastre wrote:
 

    Hi,

    Sebastian Sastre wrote:
     
>     hi,
>
>     What? That just won't work. Think of the memory overhead. 
>      
>     I don't give credit to unfounded apriorisms. I think it deserves
>     to be proved that does not work. Anyway let's just assume that may
>     be too much for state of the art hardware in common computers in
>     year 2007. What about in 2009? what about in 2012? Remember the
>     attitude you had saying this now the first day of 2012.

    It's not an unfounded apriorism as you put it.

    Current hardware and technology expected in the next ten years isn't
    optimized for N hundred thousand or N million threads of execution.
    Maybe in the future that will be the case.

    The Tile-64 processor is expected to grow to about 4096 processors
    by pushing the limits of technology beyond what they are today. To
    reach the levels you are talking about for a current Smalltalk image
    with millions of objects each having their own thread (or process)
    isn't going to happen anytime soon.

    I work with real hardware.

    I am open and willing to be pleasantly surprised however. 

    Peter.. Peter.. you have to fight a little harder to that demon.
    Look I asked you to read my previous post with subject "One Process
    Per Instance" where I taked the time (so money) of explainly as
    didactic as I can how *your* millon object example could be managed
    in a system like the one I'm speculating with. So please, please
    Peter, I ask you not to make repeat myself go read it and make your
    statements there if you found them. As I already said I think the
    experiences you are sharing in this matter are precious so
    discussion gets just richer.

I recall a counter example of the million objects was to split the data 
objects into 10,000 chunks. However, that's a different problem not the 
one that I have to deal with.


>     Tying an object instance to a particular process makes no sense.
>     If you did that you'd likely end up with just as many dead locks
>     and other concurrency problems since you'd now have message sends
>     to the object being queued up on the processes input queue. Since
>     processes could only process on message at a time deadlocks can
>     occur - plus all kinds of nasty problems resulting from the order
>     of messages in the queue. (There is a similar nasty problem with
>     the GUI event processing in VisualAge Smalltalk that leads to very
>     difficult to diagnose and comprehend concurrency problems). It's a
>     rats maze that's best to avoid.
>      
>     Besides, in some cases an object with multiple threads could
>     respond to many messages - literally - at the same time given
>     multiple cores. Why slow down the system by putting all the
>     messages into a single queue when you don't have to!?
>     You didn't understand the model I'm talking about.

    That is likely the case.  
     
    So I ask you kindly if you can read my previus emails where I where
    I have taken the job of expresing my exploratory thoughts until
    reached this model and the speculation about the existence of this
    model (consequences).

There are so many emails in this thread please link to the emails you'd 
like me to reread. Thanks very much.


>     There isn't such a thing as an object with multiple trheads. That
>     does not exists in this model.

    Ok. I got that.


>     It does exists one process per instance no more no less.

    I did get that. Even if you only do that logically you've got
    serious problems. 
     
    If you have read where I talk about how to manage with this model N
    millon objects with limited hardware and you still found problems
    please be my guest to inform me here because I want to know that as
    soon as possible.

Yes, I know it's possible for systems like Erlang to have 100,000 
virtual processes, aka lightweight threads, that can be in one real 
native operating system process or across many native processes.

Are you saying that you've figured out how to do that with millions of 
processes?


>     I think you're thinking about processes and threads the same way
>     you know them today.

    I can easily see such a scenario working and also breaking all over
    the place. 
     
    Why? 


>     Lets see if this helps you to get the idea: Desambiguation: for
>     this model I'm talking about process not as an OS process but as a
>     VM light process which we also use to call them threads.

    Ok.


>     So I'm saying that in this model you have only one process per
>     instance but that process is not a process that can have threads
>     belonging to it.

    ok.

>     That generates a hell of complexity.

    You lost me there. What complexity?
     
    Does not matter is other model not the one I'm speculating.
    (probably one you have imagined before clarifiying the 1:1 object
    process thing)

Alan Kay's original work suggested that each object had a process. There 
is the logical view and the idealized view. Then there is the concrete 
and how to implement things. How one explains things to end users is 
often with the idealized view. How one implements is more often with 
something that isn't quite the idea.


>     The process I'm saying it's tied to an instance it's more close to
>     the process word you know from dictionary plus what you know what
>     an instance is and with the process implemented by a VM that can
>     balance it across cores.

    I didn't understand. Please restate.  
     
    I restates that N times in my previus emails being too long. To give
    you a clue it's about the double nature I'm saying the object has.
    An amalgam between object and process. It's conceptual
    indissociability. More on those previus emails.

Alright I'll have to reread the entire thread since no one wants to 
clearly state their pov in one email as I attempt to do out of courtesy. 
I don't have time to reread the entire thread today though. (That's why 
it is a courtesy to repost - it saves your readers time).


>
>     I'm not falling in the pitfall of start trying to parallelize code
>     automagically. This far from it. In fact I think this is better
>     than that illusion. Every message is guaranteed by VM to reach
>     it's destination in guaranteed order. Otherwise will be chaos. And
>     we want an ordered chaos like the one we have now in a Squeak
>     reified image.

    Yes, squeak is ordered chaos. ;--).



>      
>     Clarified that I ask why do you think could be deadlocks? and what
>     other kind of concurrency problems do you think that will this
>     model suffer?


    If a number of messages are waiting in the input queue of a process
    that can only process one message at a time since it's not
    multi-threaded then those messages are BLOCKED while in the thread.
    Now imagine another process with messages in it's queue that are
    also BLOCKED since they are waiting in the queue and only one
    message can be processed at a time. Now imagine that process A and
    process B each have messages that the other needs before it can
    proceed but those messages are BLOCKED waiting for processing in the
    queues.

    This is a real example of what can happen with message queues. The
    order isn't guaranteed. Simple concurrency solutions often have
    deadlock scenarios. This can occur when objects must synchronize
    events or information. As soon as you have multiple threads of
    execution you've got problems that need solving regardless of the
    concurrency model in place. 

     But that can happen right now if you give a bad use of process in a
    current Smalltalk. I don't want to solve deadlocks for anybody using
    parallelism badly. I just want a Smalltalk that works like todays
    but balancing cpu load across cores and scaling to an arbitrary
    number of them. All this trhead it's about that.


Yes it can happen now. That's why it's important to actually learn 
concurrency control techniques. Books like the Little Book of Semaphores 
can help with that learning process.

The point is that a number of people in this thread are proposing 
solutions that seem to claim that these problems magically go away in 
some utopian manner with process-based concurrency. All I'm pointing out 
is that there isn't a silver bullet or concurrency utopia and now I'm 
getting flack for pointing out that non-ignorable reality. So be it. 
Those that push ahead are often the ones with many arrows in their back.


>
>     Tying an object's life time to the lifetime of a process doesn't
>     make sense since there could be references to the object all over
>     the place. If the process quits the object should still be alive
>     IF there are still references to it.
>     You'd need to pass around more than references to processes. For
>     if a process has more than one object you'd not get the resolution
>     you'd need. No, passing object references around is way better.
>      
>     Yes of course there will be. In this system a process termination
>     is one of two things: A) that instance is being reclaimed in a
>     garbage collection or B) that instance has been written to disk in
>     a kind of hibernation that can be reified again on demand.  Please
>     refer to my previous post with subject "One Process Per
>     Instance.." where I talk more about exacly this.

    If all there is is a one object per process and one process per
    object - a 1 to 1 mapping then yes gc would work that way but the 1
    to 1 mapping isn't likely to ever happen given current and future
    hardware prospects. 

     But Peter don't lower your guard on that so easy! we know
    techniques to administer resources like navegating 10 thousand
    instances at the time a 10 gigas image of 10 million objects! don't
    shoot hope before it borns! I talk some details I've imagined about
    this in my "One Process Per Instance" post.


Well then you must have a radically different meaning of 1 to 1 object 
to process mapping than I have or a radically different implementation 
that I've understood from your writings. If you can make it work that is 
all the proof that you need, isn't it!?


>
>     Even if you considered an object as having it's own "logical"
>     process you'd get into the queuing problems hinted at above.
>      
>     Which I dont see and I ask your help to understand if you still
>     find them after the clarifications made about the model.

    See the example above.


>
>     Besides objects in Smalltalk are really fine grained. The notion
>     that each object would have it's own thread would require so much
>     thread switching that no current processor could handle that. It
>     would also be a huge waste of resources.
>     And what do you think was going out of the mouths of criticizers
>     of the initiatives like the park place team had in 1970's making a
>     Smalltalk with the price of the CPU's and RAM at that time? that
>     VM's are a smart efficient use of resources?

    That's not really relevant. If you want to build that please go
    ahead - please don't let me stop you, that's the last thing I'd
    want. I wish you luck. I get to play with current hardware and
    hardware that's coming down the pipe such as the Tile-64 or the
    newest GPUs when they are available to the wider market.
     
    We all have to use cheap hardware. Please (re)think about what I
    said about administering hardware resources over this model.






>      
>     So I copy paste myself: "I don't give credit to unfounded
>     apriorisms. It deserves to be proven that does not work. Anyway
>     let's just assume that may be too much for state of the art
>     hardware in common computers in year 2007. What about in 2009?
>     what about in 2012?"

    Well just get out your calculator. There is an overhead to a thread
    or process in bytes. Say 512 bytes per thread plus it's stack. There
    is the number of objects. Say 1 million for a medium to small image.
    Now multiply those and you get 1/2 gigabyte. Oh, we forgot the stack
    space and the memory for the objects themselves. Add a multiplier
    for that, say 8 and you get 4 gigabytes. Oh, wait we forgot that the
    stack is kind weird since as each message send that isn't to self
    must be an interprocess or interthread message send you've got some
    weirdness going on let along all the thread context switching for
    each message send that occurs. Then you've got to add more for who
    knows what... the list could go on for quite a bit. It's just mind
    boggling. 
     
    I just can't beleive we really can't find clever ways of adminiter
    resources to the point in which this becomes acceptable.


Each thread needs a stack, a stored set of registers. That's at least 
two four kilobyte memory pages (one for the stack and one for the 
registers) with current hardware assuming your thread is mapped to a 
real processor thread of execution at some point. The two pages are 
there so that you can have the processor detect if the stack grows 
beyond it's four kilobyte page. Now you could pack them into one page 
when it's not being executed but that would increase your context switch 
time to pack and unpack. If you avoid that and simply use the same page 
for both then you're risking having your stack overwrite memory used for 
the process/thread which would be unsafe multi-threading.

Maybe some hardware designers will figure it out.

However, there is still the worst pitfall of the 1 to 1 mapping of 
process to object: that is the overhead of each message send to another 
object would require a thread context switch! That's is inescapably huge.



    Simply put current cpu architectures are simply not designed for
    that approach. Heck they are even highly incompatible with dynamic
    message passing since they favor static code in terms of optimizations. 

     Yes that happens with machines based on mathematic models like the
    boolean model. It injects an inpedance mismatch between the
    conceptual modeling and the virtual modeling.

>
>     Again, one solution does not fit all problems - if it did
>     programming would be easier.
>      
>     But programming should have to be easier.
    Yes, I concur, whenever it's possible to do so. But it also
    shouldn't ignore the hard problems either.


>     Smalltalk made it easier in a lot of aspects.

    Sure I concur. That's why I am working here in this group spending
    time (is money) on these emails.

>     Listen.. I'm not a naif silver bullet purchaser nor a faithful
>     person. I'm a critic Smalltalker that thinks he gets the point
>     about OOP and tries to find solutions to surpass the multicore
>     crisis by getting an empowered system not consoling itself with a
>     weaker one.

    I do get that about you.


>     Peter please try to forget about how systems are made and think in
>     how you want to make them.

    I do think about how I want to make them. However to make them I
    have no choice but to consider how to actually build them using
    existing technologies and the coming future technologies.

    Currently we have 2-core and 4-core processors as the mainstream
    with 3-core and 8-core coming to a computer store near you. We have
    the current crop of GPUs from NVidia that have 128 processing units
    that can be programmed in a variant of C for some general purpose
    program tasks using a SIMD (single instruction multiple data) format
    - very useful for those number crunching applications like graphics,
    cryptology and numeric analysis to name just a few. We also have the
    general purpose networked Tile-64 coming - lots of general purpose
    compute power with an equal amount of scalable networked IO power -
    very impressive. Intel even has a prototype with 80-cores that is
    similar. Intel also has it's awesomely impressive Itanium processor
    with instruction level parallelism as well as multiple cores - just
    wait till that's a 128 core beastie. Please there is hardware that
    we likely don't know about or that hasn't been invented yet. Please
    bring it on!!!

    The bigger problem is that in order to build real systems I need to
    think about how they are constructed.

    So yes, I want easy parallel computing but it's just a harsh reality
    that concurrency, synchronization, distributed processing, and other
    advanced topics are not always easy or possible to simplify as much
    as we try to want them to be. That is the nature of computers.

    Sorry for being a visionary-realist. Sorry if I've sounded like the
    critic. I don't mean to be the critic that kills your dreams - if
    I've done that I apologize. I've simply meant to be the realist who
    informs the visionary that certain adjustments are needed.

    All the best,

    Peter

     Please take your time to think about what I've stated of
    administering resources being possible to manage load of millions of
    instances by a swarm of a few at the time. And don't be sorry of
    anything. I love criticism. Our culture need tons of criticism to be
    stronger. It's the only way we can unistall deprecated or obsolete
    ideas. You are helping here. If I really dreaming an this don't work
    I want that
    dream to be kill now so I can spend my time in something better.
    That helps.


Well having loads of millions of instances managed by a swarm of them at 
once is what I was assuming. In fact Linux does this (well for thousands 
not millions anyway). It turns out that Intel's X86/IA32 architecture 
can only handle 4096 threads in hardware. What Linux did was virtualize 
them so that only one hardware thread was used for the active thread 
(per core I would assume). This allowed Linux to avoid the glass ceiling 
of 4096 threads. However, there are limits due to the overhead of 
context switching time and the overhead of space that each thread - even 
with a minimal stack as would be the case with the model you are 
proposing might have. It's just too onerous for practical use.

Unless you are doing something radically different that I don't 
understand that is.

     By  now this model it's just getting stronger. Please try to get it
    down !!!   :)))


I though I crushed it already!!! ;--)

Certainly until you can provide a means for it to handle the one million 
data objects across 10,000 processes with edits going to the 10,000 
processes plus partial object graph seeding (and any object on demand) 
to them and end up with one and a half million output objects with the 
total number of interconnections increased by 70% I'll consider it 
crushed. ;--)

Forward to the future - to infinity and beyond with real hardware!

Cheers,

Peter







-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20071025/9d80a7b9/attachment.htm


More information about the Squeak-dev mailing list