[squeak-dev] Re: KedamaGPU, etc. (was: "OpenCL bindings for Squeak")
Josh Gargus
josh at schwa.ca
Sun Feb 21 21:31:00 UTC 2010
On Feb 21, 2010, at 11:52 AM, Yoshiki Ohshima wrote:
> At Sat, 20 Feb 2010 16:54:19 -0800,
> Josh Gargus wrote:
>>
>> On Feb 20, 2010, at 2:48 PM, Yoshiki Ohshima wrote:
>>
>>> At Sat, 20 Feb 2010 14:03:37 -0800,
>>> Josh Gargus wrote:
>>>>
>>>> While I was hacking away on my OpenCL bindings, I was thinking about what kind of small, fun demos I could include. When I was first exposed to Squeak, one of the things that hooked me were the various Morphic demos, like curved text, bouncing-atoms, and the magnifier morph, all with the source code right there to learn from. Jeff Pierce's wonderful port of Alice did the same thing for 3D.
>>>>
>>>> We're at the beginning of a new era in computing, where a $1000 laptop has a CPU with 4 cores and a GPU with dozens. What will be the new demos that catch the imagination of teenage Squeakers that are growing up with such computers?
>>>>
>>>> The most obvious idea is to integrate Yoshiki's Kedama with OpenCL. Conceptually, this seems to be a perfect fit, and I think that it would be a lot of fun. Anybody interested in working on this with me? Yoshiki?
>>>
>>> Ah, cool. Incidentally, I am working on an array processing object
>>> model and language that is supposedly a bit more generalized than
>>> Kedama, and someday I want to hook that up with GPUs.
>>
>>
>> Great! I just downloaded your dissertation today... is it still authoritative, or are there some aspects of the system that are covered better in subsequent documents?
>
> It matches fairly ok with the implementation. The most relevant
> part in this context is the plugin code but it is not explained
> anywhere though.
At least the expert is available. :-)
>
>>> As for flexibility, also one of Kedama's points as well, would be to
>>> be able to dynamically modify the behavior of particles at runtime. I
>>> haven't done my homework yet, but what would be the strategy for doing
>>> dynamic code change?
>>
>>
>> In current GPGPU architectures, execution is most efficient when
>> items in the same "work group" follow the same code path. For
>> example, say that you have particles representing ants that have 10
>> possible different behaviors specified by an integer from 1-10 (and
>> for simplicity, say that each of these behaviors takes 1000 clock
>> cycles to run). Further, let's say that you naively write this as a
>> switch-statement in the OpenCL code... a different code-path is
>> dynamically chosen depending on the behavior-index for that ant.
>> Current architectures are inefficient in the case where ants in the
>> same work-group take different branches through the code. If all
>> ants have the same behavior, it will take 1000 clock cycles. If the
>> ants use 2 or the possible 10 behaviors, it will take 2000 clock
>> cycles. In the worst cast (ants use all 10 behaviors) then it will
>> take 10000 clock cycles.
>
> Right. I had gone through some CUDA documents and this part appears
> the same.
Yes, it's built into the hardware.
>
>> The GPU can execute multiple work-groups at the same time
>> (approximately 16 today). So, if you have some way of grouping ants
>> with the same current behavior into the same work-group, then you
>> can improve efficiency greatly compared to assigning them randomly
>> to workgroup. Of course, this assignment will have overhead.
>>
>> The above assumes that all behaviors are already known. You're
>> probably also interested in code-generation. To do this, you could
>> synthesize a String containing the new source-code that you want to
>> use, and upload the compiled code before running the next iteration
>> of the simulation. There's currently no way to generate binary
>> code. There's no fundamental technical reason for this, but OpenCL
>> is immature at this point, and it will be years before the vendors
>> can agree upon a suitable format.
>
> Some form of code generation but just having a fixed set of
> primitives, and calling them from some "interpreted code" calls them
> would be a workable strategy. Most of deviated behavior are kind of
> selective write back; a line of expression is executed for all turtles
> but the final assignment is masked by a boolean vector.
That sound very doable.
>
> There was automatic sequentialization when potentially multiple
> turtles are writing into the same variable at one "step". This was
> needed semantically. I wonder if this automatic (and somewhat eager)
> serial execution is good for a GPU implementation or not.
Nope, it isn't. Why is it necessary for the semantics? Which part of your dissertation describes this?
Cheers,
Josh
>
> -- Yoshiki
More information about the Squeak-dev
mailing list
|