[squeak-dev] Prepare for Thousands of Cores --- oh my Chip - it's full of cores!

Joshua Gargus schwa at fastmail.us
Sun Jul 6 18:28:13 UTC 2008


On Jul 6, 2008, at 10:09 AM, Igor Stasenko wrote:
>
> From:
> http://en.wikipedia.org/wiki/CUDA
> ----
> Threads must run in groups of at least 32 threads that execute
> identical instructions simultaneously. Branches in the program code do
> not impact performance significantly, provided that each of 32 threads
> takes the same execution path; the SIMD execution model becomes a
> significant limitation for any inherently divergent task (e.g.,
> traversing a ray tracing acceleration data structure).
> ----
>
> Despite that we can program GPU, we can't make it to run different  
> code :(
> Also, its something utterly wrong with this statement.
> Since its waste to run 32 threads on same set of input data, it
> obvious that input is different. But since input data is different,
> how it possible that all branches taking same path for each thread?

They don't have to take the same branch, but performance can suffer if  
they take different branches.

As a real-world example of different inputs taking the same branch,  
consider the example of cel shading (http://en.wikipedia.org/wiki/Cel_shading 
).  Each pixel is processed by a separate thread.  You might have a  
bit of code like 'if (diffuse_component <  threshold) then color =  
shadow_color; else color = lit_color'.  You only need an 8x4 block of  
pixels to fill 32 threads, and the majority of 32-pixel blocks do  
execute the same path through the code.

I don't get to work on this sort of thing as much as I'd like, so I  
can't be completely certain about the following statement.  But, I  
believe that the above code snippet wouldn't result in bad performance  
even if some pixels within a block took one branch, and one took the  
other.  As I understand it, all of the threads in a block have to  
finish at the same time, so they can start on the next chunk of input  
at the same time.  So, if you have 31 threads that take the fast path,  
and 1 thread that branches into a longer computation, then the other  
31 threads are held up for the one.

Was that clear?

Cheers,
Josh

>
>
> -- 
> Best regards,
> Igor Stasenko AKA sig.
>




More information about the Squeak-dev mailing list