I am preparing my presentation for the simGPU meeting next week in Freudenstadt, Germany, and performed some benchmarks.
In the previous post I described how to get an OpenCL program running on a smartphone with GPU. By now Christoph Kreisbeck and I are getting ready to release our first smartphone GPU app for exciton dynamics in photosynthetic complexes, more about that in a future entry.
Getting the same OpenCL kernel running on laptop GPUs, workstation GPUs and CPUs, and smartphones/tablets is a bit tricky, due to different initialisation procedures and the differences in the optimal block sizes for the thread grid. In addition on a smartphone the local memory is even smaller than on a desktop GPU and double-precision floating point support is missing. The situation reminds me a bit of the “earlier days” of GPU programming in 2008.
Besides being a proof of concept, I see writing portable code as a sort of insurance with respect to further changes of hardware (however always with the goal to stick with the massively parallel programming paradigm). I am also amazed how fast smartphones are gaining computational power through GPUs!
Here some considerations and observations:
Standard CUDA code can be ported to OpenCL within a reasonable time-frame. I found the following resources helpful:
AMDs porting remarks
Matt Scarpinos OpenCL blog
- The comparison of OpenCL vs CUDA performance for the same algorithm can reveal some surprises on NVIDIA GPUs. While on our C2050 GPU OpenCL works a bit faster for the same problem compared to the CUDA version, on a K20c system for certain problem sizes the OpenCL program can take several times longer than the CUDA code (no changes in the basic algorithm or workgroup sizes).
- The comparison with a CPU version running on 8 cores of the Intel Xeon machine is possible and shows clearly that the GPU code is always faster, but requires a certain minimal systems size to show its full performance.
- I am looking forward to running the same code on the Intel Xeon Phi systems now available with OpenCL drivers, see also this blog.
[Update June 22, 2013: I updated the graphs to show the 8-core results using Intels latest OpenCL SDK. This brings the CPU runtimes down by a factor of 2! Meanwhile I am eagerly awaiting the possibility to run the same code on the Xeon Phis…]