[squeak-dev] Re: KedamaGPU, etc. (was: "OpenCL bindings for Squeak")

Josh Gargus josh at schwa.ca
Sun Feb 21 21:31:00 UTC 2010


On Feb 21, 2010, at 11:52 AM, Yoshiki Ohshima wrote:

> At Sat, 20 Feb 2010 16:54:19 -0800,
> Josh Gargus wrote:
>> 
>> On Feb 20, 2010, at 2:48 PM, Yoshiki Ohshima wrote:
>> 
>>> At Sat, 20 Feb 2010 14:03:37 -0800,
>>> Josh Gargus wrote:
>>>> 
>>>> While I was hacking away on my OpenCL bindings, I was thinking about what kind of small, fun demos I could include.  When I was first exposed to Squeak, one of the things that hooked me were the various Morphic demos, like curved text, bouncing-atoms, and the magnifier morph, all with the source code right there to learn from.  Jeff Pierce's wonderful port of Alice did the same thing for 3D.
>>>> 
>>>> We're at the beginning of a new era in computing, where a $1000 laptop has a CPU with 4 cores and a GPU with dozens.  What will be the new demos that catch the imagination of teenage Squeakers that are growing up with such computers?
>>>> 
>>>> The most obvious idea is to integrate Yoshiki's Kedama with OpenCL.  Conceptually, this seems to be a perfect fit, and I think that it would be a lot of fun.  Anybody interested in working on this with me?  Yoshiki?
>>> 
>>> Ah, cool.  Incidentally, I am working on an array processing object
>>> model and language that is supposedly a bit more generalized than
>>> Kedama, and someday I want to hook that up with GPUs.
>> 
>> 
>> Great!  I just downloaded your dissertation today... is it still authoritative, or are there some aspects of the system that are covered better in subsequent documents?
> 
>  It matches fairly ok with the implementation.  The most relevant
> part in this context is the plugin code but it is not explained
> anywhere though.


At least the expert is available. :-)


> 
>>> As for flexibility, also one of Kedama's points as well, would be to
>>> be able to dynamically modify the behavior of particles at runtime.  I
>>> haven't done my homework yet, but what would be the strategy for doing
>>> dynamic code change?
>> 
>> 
>> In current GPGPU architectures, execution is most efficient when
>> items in the same "work group" follow the same code path.   For
>> example, say that you have particles representing ants that have 10
>> possible different behaviors specified by an integer from 1-10 (and
>> for simplicity, say that each of these behaviors takes 1000 clock
>> cycles to run).  Further, let's say that you naively write this as a
>> switch-statement in the OpenCL code... a different code-path is
>> dynamically chosen depending on the behavior-index for that ant.
>> Current architectures are inefficient in the case where ants in the
>> same work-group take different branches through the code.  If all
>> ants have the same behavior, it will take 1000 clock cycles.  If the
>> ants use 2 or the possible 10 behaviors, it will take 2000 clock
>> cycles.  In the worst cast (ants use all 10 behaviors) then it will
>> take 10000 clock cycles.
> 
>  Right.  I had gone through some CUDA documents and this part appears 
> the same.


Yes, it's built into the hardware.


> 
>> The GPU can execute multiple work-groups at the same time
>> (approximately 16 today).  So, if you have some way of grouping ants
>> with the same current behavior into the same work-group, then you
>> can improve efficiency greatly compared to assigning them randomly
>> to workgroup.  Of course, this assignment will have overhead.
>> 
>> The above assumes that all behaviors are already known.  You're
>> probably also interested in code-generation.  To do this, you could
>> synthesize a String containing the new source-code that you want to
>> use, and upload the compiled code before running the next iteration
>> of the simulation.  There's currently no way to generate binary
>> code.  There's no fundamental technical reason for this, but OpenCL
>> is immature at this point, and it will be years before the vendors
>> can agree upon a suitable format.
> 
>  Some form of code generation but just having a fixed set of
> primitives, and calling them from some "interpreted code" calls them
> would be a workable strategy.  Most of deviated behavior are kind of
> selective write back; a line of expression is executed for all turtles
> but the final assignment is masked by a boolean vector.


That sound very doable.


> 
>  There was automatic sequentialization when potentially multiple
> turtles are writing into the same variable at one "step".  This was
> needed semantically.  I wonder if this automatic (and somewhat eager)
> serial execution is good for a GPU implementation or not.


Nope, it isn't.  Why is it necessary for the semantics?  Which part of your dissertation describes this?

Cheers,
Josh



> 
> -- Yoshiki




More information about the Squeak-dev mailing list