[squeak-dev] Re: KedamaGPU, etc. (was: "OpenCL bindings for Squeak")

Sun Feb 21 19:52:26 UTC 2010

At Sat, 20 Feb 2010 16:54:19 -0800,
Josh Gargus wrote:
> 
> On Feb 20, 2010, at 2:48 PM, Yoshiki Ohshima wrote:
> 
> > At Sat, 20 Feb 2010 14:03:37 -0800,
> > Josh Gargus wrote:
> >> 
> >> While I was hacking away on my OpenCL bindings, I was thinking about what kind of small, fun demos I could include.  When I was first exposed to Squeak, one of the things that hooked me were the various Morphic demos, like curved text, bouncing-atoms, and the magnifier morph, all with the source code right there to learn from.  Jeff Pierce's wonderful port of Alice did the same thing for 3D.
> >> 
> >> We're at the beginning of a new era in computing, where a $1000 laptop has a CPU with 4 cores and a GPU with dozens.  What will be the new demos that catch the imagination of teenage Squeakers that are growing up with such computers?
> >> 
> >> The most obvious idea is to integrate Yoshiki's Kedama with OpenCL.  Conceptually, this seems to be a perfect fit, and I think that it would be a lot of fun.  Anybody interested in working on this with me?  Yoshiki?
> > 
> >  Ah, cool.  Incidentally, I am working on an array processing object
> > model and language that is supposedly a bit more generalized than
> > Kedama, and someday I want to hook that up with GPUs.
> 
> 
> Great!  I just downloaded your dissertation today... is it still authoritative, or are there some aspects of the system that are covered better in subsequent documents?

  It matches fairly ok with the implementation.  The most relevant
part in this context is the plugin code but it is not explained
anywhere though.

> >  As for flexibility, also one of Kedama's points as well, would be to
> > be able to dynamically modify the behavior of particles at runtime.  I
> > haven't done my homework yet, but what would be the strategy for doing
> > dynamic code change?
> 
> 
> In current GPGPU architectures, execution is most efficient when
> items in the same "work group" follow the same code path.   For
> example, say that you have particles representing ants that have 10
> possible different behaviors specified by an integer from 1-10 (and
> for simplicity, say that each of these behaviors takes 1000 clock
> cycles to run).  Further, let's say that you naively write this as a
> switch-statement in the OpenCL code... a different code-path is
> dynamically chosen depending on the behavior-index for that ant.
> Current architectures are inefficient in the case where ants in the
> same work-group take different branches through the code.  If all
> ants have the same behavior, it will take 1000 clock cycles.  If the
> ants use 2 or the possible 10 behaviors, it will take 2000 clock
> cycles.  In the worst cast (ants use all 10 behaviors) then it will
> take 10000 clock cycles.

  Right.  I had gone through some CUDA documents and this part appears 
the same.

> The GPU can execute multiple work-groups at the same time
> (approximately 16 today).  So, if you have some way of grouping ants
> with the same current behavior into the same work-group, then you
> can improve efficiency greatly compared to assigning them randomly
> to workgroup.  Of course, this assignment will have overhead.
>
> The above assumes that all behaviors are already known.  You're
> probably also interested in code-generation.  To do this, you could
> synthesize a String containing the new source-code that you want to
> use, and upload the compiled code before running the next iteration
> of the simulation.  There's currently no way to generate binary
> code.  There's no fundamental technical reason for this, but OpenCL
> is immature at this point, and it will be years before the vendors
> can agree upon a suitable format.

  Some form of code generation but just having a fixed set of
primitives, and calling them from some "interpreted code" calls them
would be a workable strategy.  Most of deviated behavior are kind of
selective write back; a line of expression is executed for all turtles
but the final assignment is masked by a boolean vector.

  There was automatic sequentialization when potentially multiple
turtles are writing into the same variable at one "step".  This was
needed semantically.  I wonder if this automatic (and somewhat eager)
serial execution is good for a GPU implementation or not.

-- Yoshiki