Computational physics on GPUs: writing portable code

GPU-HEOM code comparison for various hardware.
Runtime in seconds for our GPU-HEOM code on various hardware and software platforms.

I am preparing my presentation for the simGPU meeting next week in Freudenstadt, Germany, and performed some benchmarks.
In the previous post I described how to get an OpenCL program running on a smartphone with GPU. By now Christoph Kreisbeck and I are getting ready to release our first smartphone GPU app for exciton dynamics in photosynthetic complexes, more about that in a future entry.
Getting the same OpenCL kernel running on laptop GPUs, workstation GPUs and CPUs, and smartphones/tablets is a bit tricky, due to different initialisation procedures and the differences in the optimal block sizes for the thread grid. In addition on a smartphone the local memory is even smaller than on a desktop GPU and double-precision floating point support is missing. The situation reminds me a bit of the “earlier days” of GPU programming in 2008.
Besides being a proof of concept, I see writing portable code as a sort of insurance with respect to further changes of hardware (however always with the goal to stick with the massively parallel programming paradigm). I am also amazed how fast smartphones are gaining computational power through GPUs!
Same comparison for smaller memory consumption. Note the drop in OpenCL performance for the NVIDIA K20c GPU.
Same comparison for smaller memory consumption. Note the drop in OpenCL performance for the NVIDIA K20c GPU.

Here some considerations and observations:

  1. Standard CUDA code can be ported to OpenCL within a reasonable time-frame. I found the following resources helpful:
    AMDs porting remarks
    Matt Scarpinos OpenCL blog
  2. The comparison of OpenCL vs CUDA performance for the same algorithm can reveal some surprises on NVIDIA GPUs. While on our C2050 GPU OpenCL works a bit faster for the same problem compared to the CUDA version, on a K20c system for certain problem sizes the OpenCL program can take several times longer than the CUDA code (no changes in the basic algorithm or workgroup sizes).
  3. The comparison with a CPU version running on 8 cores of the Intel Xeon machine is possible and shows clearly that the GPU code is always faster, but requires a certain minimal systems size to show its full performance.
  4. I am looking forward to running the same code on the Intel Xeon Phi systems now available with OpenCL drivers, see also this blog.

[Update June 22, 2013: I updated the graphs to show the 8-core results using Intels latest OpenCL SDK. This brings the CPU runtimes down by a factor of 2! Meanwhile I am eagerly awaiting the possibility to run the same code on the Xeon Phis…]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.