# Computational physics & GPU programming: interacting many-body simulation with OpenCL

In the second example of my series on GPU programming for scientists, I discuss a short OpenCL program, which you can compile and run on the CPU and the GPUs of various vendors. This gives me the opportunity to perform some cross-platform benchmarks for a classical plasma simulation. You can expect dramatic (several 100 fold) speed-ups on GPUs for this type of system. This is one of the reasons why molecular dynamics code can gain quite a lot by incorporating the massively parallel-programming paradigm in the algorithmic foundations.

The Open Computing Language (OpenCL) is relatively similar to its CUDA pendant, in practice the setup of an OpenCL kernel requires some housekeeping work, which might make the code look a bit more involved. I have based my interacting electrons calculation of transport in the Hall effect on an OpenCL code. Another examples is An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs (arxiv version) by C. Ó Broin and L.A.A. Nikolopoulos.

Now to the coding of a two-dimensional plasma simulation, which is inspired by Laughlin’s mapping of a many-body wave function to an interacting classical ersatz dynamics (for some context see my short review Interacting electrons in a magnetic field: mapping quantum mechanics to a classical ersatz-system on the arxiv).

First, I assume that you have a working OpenCL installation, either for a GPU (NVIDIA or AMD), or for the CPU (provided for download by Intel, AMD, IBM, or  other CPU vendors).

For OpenCL the basic steps are very similar to the previously discussed CUDA example.

1. Define arrays which hold the positions and velcoties of all the particles contained in the simulation. Here it pays off to work with vector strcutures of 4 components, since GPUs are optimized to handle 4-tuples of information (three colors RGB and transparency A). For the two-dimensional electrons, I do not merge the 2 positions and 2 velocities in 1 vector, rather I use the extra dimensions to store for example the charge of the particle in the 4th component, or possible a particle index.
2. For convenience I initialize the system on the CPU and transfer the data to the GPU.
3. The GPU code for OpenCL is loaded and compiled only at run-time of the program (this is different compared to CUDA). This means you have to watch out for errors and warnings once you launch the main program.
4. In this simple example, just Newton’s equations are integrated using a very basic Euler integrator. There is lots of room for improvement, for example by switching to other integrators. Every 100 steps, I transfer the state of the electron gas (positions and momenta) back to the CPU. Here, just for output. In the much more advanced transport incarnation of the code, at this point it is decided if new electrons are injected and/or electrons are removed according to rules at the open-system boundary.
5. Rinse and repeat for the next time-step.

For convenience I wrap all the required memory pointers in a small class:

```class PropagateNBodySystem
{
private:
Context      *context;
CommandQueue *queue;
Kernel       *kernel_eom;
Program      *program_eom;

Buffer *gposold; // compute device: electron positions
Buffer *gvelold; // compute device: electron velocities
Buffer *gposnew; // compute device: store temp electron positions
Buffer *gvelnew; // compute device: store temp electron velocities

float *hposold;
float *hvelold;
float *hposnew;
float *hvelnew;
public:
PropagateNBodySystem(void);
void Initialize();
void InitializeOpenCL(bool talky);
void PropagateDoubleStep(float DeltaTime);
void PutElectrons(int NumBodies, float *hposnew, float *hvelnew );
void GetElectrons(int NumBodies, float *hposnew, float *hvelnew );
void WriteState(const char* filename);
void RunSimulation( void );
};
```

The Initialize member functions are reproduced below in the complete example. The InitializeOpenCL(); function loads the so-called kernel from a file called “integrate_eom_kernel.cl”

```/* integrate_eom_kernel.cl */
__kernel void integrate_eom(
__global float4* oldPosition,
__global float4* oldVelocity,
__global float4* newPosition,
__global float4* newVelocity,
__local float4* localPos,
int numElectrons,
float deltaTime
)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);

// Number of tiles we need to iterate
// all numTiles must be multiples of localsize to ensure an integer division without reminder
unsigned int numTilesElectrons = numElectrons / localSize;

// pick work-item for which we will calculate the new position
float4 pos = oldPosition[gid];
float4 vel = oldVelocity[gid];

float m=1.0;

// initilize acceleration to 0
float2 acc     = (float2)(0.0,0.0);
float2 acc_aux = (float2)(0.0,0.0);

// printf("tid=%d gid=%d pos=(%+e,%+e,%+e,%+e)\n",tid,gid,pos.x,pos.y,pos.z,pos.w);

// first calculate the electron-electron repulsion
for(int i=0;i<numTilesElectrons;++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = oldPosition[idx];

// synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);

// calculate mutual acceleration
acc_aux.x=0.0;
acc_aux.y=0.0;
for(int j = 0; j < localSize; ++j)
{
// calculate acceleration caused by particle j on particle i
float4 r = localPos[j] - pos;
float distSqr = r.x * r.x  +  r.y * r.y;

distSqr += 1.0e-9;
// Pauli force repulsion
float s=m*m*localPos[j].w*pos.w/distSqr;
{
acc_aux.x +=-s*r.x;
acc_aux.y +=-s*r.y;
}
}
// Synchronize so that next tile can be loaded
acc.x+=acc_aux.x;
acc.y+=acc_aux.y;
barrier(CLK_LOCAL_MEM_FENCE);
}

// second calculate the forces exerted by the positive background charges
{
acc.x+=-0.5*m*pos.x;
acc.y+=-0.5*m*pos.y;
}
// simplistic integration
{
vel.x += acc.x * deltaTime;
vel.y += acc.y * deltaTime;

pos.x += vel.x * deltaTime;
pos.y += vel.y * deltaTime;
}

newPosition[gid] = pos;
newVelocity[gid] = vel;
}
```

With some software-development toolkits ships a nice n-body simulation code sample and the strategy for optimizing the parallel calculation of all the mutual two-body forces is well known, see for instance the excellent chapter 31 Fast N-Body Simulations with CUDA in GPU Gems vol. 3.

Now we can put everything together, put this time you need to save the code in several files. I suggest to name the following main program “plasma_disk.cpp”

```/* plasma_disk.cpp
*/
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS

#include <CL/cl.hpp>
#include <utility>
#include <iostream>
#include <fstream>
#include <string>
#include <cstdio>
#include <cmath>
#include <cstdlib>
#include "opencl_error.h"

int NUMPART=256;
using namespace cl;

#ifdef GPU
int GROUP_SIZE=256;
#else
int GROUP_SIZE=1;
#endif

class PropagateNBodySystem
{
private:
Context      *context;
CommandQueue *queue;
Kernel       *kernel_eom;
Program      *program_eom;

Buffer *gposold; // compute device: electron positions
Buffer *gvelold; // compute device: electron velocities
Buffer *gposnew; // compute device: store temp electron positions
Buffer *gvelnew; // compute device: store temp electron velocities

float *hposold;
float *hvelold;
float *hposnew;
float *hvelnew;
public:
PropagateNBodySystem(void);
void Initialize();
void InitializeOpenCL(bool talky);
void PropagateDoubleStep(float DeltaTime);
void PutElectrons(int NumBodies, float *hposnew, float *hvelnew );
void GetElectrons(int NumBodies, float *hposnew, float *hvelnew );
void WriteState(const char* filename);
void RunSimulation( void );
};

PropagateNBodySystem::PropagateNBodySystem(void)
{
//
}

void PropagateNBodySystem::Initialize()
{
hposold  =new float[4*NUMPART];
hvelold  =new float[4*NUMPART];
hposnew  =new float[4*NUMPART];
hvelnew  =new float[4*NUMPART];

srandom(0);

for(int ip=0;ip<NUMPART;ip++)
{
// generate two random variables
float rand_zerotoone=float(rand())/float(RAND_MAX);
float rand_zeroto2pi=float(rand())/float(RAND_MAX)*M_PI*2.0;

hposold[ip*4+0]=sqrt(2.0*NUMPART*rand_zerotoone)*cos(rand_zeroto2pi);
hposold[ip*4+1]=sqrt(2.0*NUMPART*rand_zerotoone)*sin(rand_zeroto2pi);
hposold[ip*4+2]=0.0f;
hposold[ip*4+3]=1.0f;
}
}

void PropagateNBodySystem::InitializeOpenCL( bool talky )
{
try
{
// Get available platforms
vector<Platform> platforms;
Platform::get(&platforms);
// Select the default platform and create a context using this platform and the GPU
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0 };
#ifdef GPU
context=new Context( CL_DEVICE_TYPE_GPU, cps);
#else
context=new Context( CL_DEVICE_TYPE_CPU, cps);
#endif
// Get a list of devices on this platform
vector<Device> devices = context->getInfo<CL_CONTEXT_DEVICES>();
// Print out information about the device
if (talky)
{
#include "print_info.h"
}
// Create a command queue and use the first device
queue=new CommandQueue(*context,devices[0]);

std::ifstream sourceFile_eom("integrate_eom_kernel.cl");
std::string sourceCode_eom(std::istreambuf_iterator<char>(sourceFile_eom),(std::istreambuf_iterator<char>()));
Program::Sources source_eom(1, std::make_pair(sourceCode_eom.c_str(), sourceCode_eom.length()+1));
// Make program of the source code in the context
program_eom = new Program(*context, source_eom);
// Build program for these specific devices, compiler argument to include local directory for header files (.h)
program_eom->build(devices,"-I.");
// Make kernel
kernel_eom=new Kernel(*program_eom, "integrate_eom");
fprintf(stderr,"integrate_eom done\n");

// Create memory buffers on GPU and populate them with the initial data
gposold  = new Buffer(*context, CL_MEM_READ_WRITE, 4*NUMPART * sizeof(float));
gvelold  = new Buffer(*context, CL_MEM_READ_WRITE, 4*NUMPART * sizeof(float));
gposnew  = new Buffer(*context, CL_MEM_READ_WRITE, 4*NUMPART * sizeof(float));
gvelnew  = new Buffer(*context, CL_MEM_READ_WRITE, 4*NUMPART * sizeof(float));
queue->enqueueWriteBuffer(*gposold, CL_TRUE, 0, 4*NUMPART * sizeof(float), hposold);
queue->enqueueWriteBuffer(*gvelold, CL_TRUE, 0, 4*NUMPART * sizeof(float), hvelold);
}
catch(Error error)
{
std::cerr << error.what() << "(" << oclErrorString(error.err()) << ")" << std::endl;
exit(1);
}
}

void PropagateNBodySystem::PropagateDoubleStep( float DeltaTime )
{
try
{
// Set arguments to kernel
kernel_eom->setArg( 0, *gposold);
kernel_eom->setArg( 1, *gvelold);
kernel_eom->setArg( 2, *gposnew);
kernel_eom->setArg( 3, *gvelnew);
kernel_eom->setArg( 4, __local(GROUP_SIZE*4*sizeof(float)));
kernel_eom->setArg( 5, NUMPART);
kernel_eom->setArg( 6, DeltaTime);

// Run the kernel on specific ND range
NDRange global(NUMPART);
NDRange local(GROUP_SIZE);

Event eventA;
queue->enqueueNDRangeKernel(*kernel_eom, NullRange, global, local,0,&eventA);
eventA.wait();

// set pointers to particles to kernel in reverse order to copy new->old

kernel_eom->setArg(0, *gposnew);
kernel_eom->setArg(1, *gvelnew);
kernel_eom->setArg(2, *gposold);
kernel_eom->setArg(3, *gvelold);

Event eventB;
queue->enqueueNDRangeKernel(*kernel_eom, NullRange, global, local,0,&eventB);
eventB.wait();
}
catch(Error error)
{
std::cerr << error.what() << "(" << error.err() << ")" << std::endl;
exit(1);
}
}

void PropagateNBodySystem::WriteState(const char* filename)
{
// transfer memory back to CPU
FILE *fd=fopen(filename,"w");
for(int i=0;i<NUMPART;i++)
{
fprintf(fd,"%e %e %e %e"   ,hposold[i*4+0],hposold[i*4+1],hposold[i*4+2],hposold[i*4+3]);
fprintf(fd," %e %e %e %e\n",hvelold[i*4+0],hvelold[i*4+1],hvelold[i*4+2],hvelold[i*4+3]);
}
fclose(fd);
}

void PropagateNBodySystem::GetElectrons(int NumBodies, float *pos, float *vel )
{
// transfer memory from GPU to CPU
queue->finish();
}

void PropagateNBodySystem::PutElectrons(int NumBodies, float *pos, float *vel )
{
// transfer memory from CPU to GPU
queue->enqueueWriteBuffer(*gposold, CL_TRUE, 0, 4*NumBodies * sizeof(float), pos);
queue->enqueueWriteBuffer(*gvelold, CL_TRUE, 0, 4*NumBodies * sizeof(float), vel);
queue->finish();
}

void PropagateNBodySystem::RunSimulation( void )
{
float deltaTime=1.0e-3f;
int nt=1001;

time_t c0,c1;

Initialize();
InitializeOpenCL(true);
PutElectrons(NUMPART,hposold,hvelold);

time(&c0);
for(int it=0;it<nt;it++) // main propagation loop
{
if (it%100==0)
{
char filename[500];
#ifdef GPU
sprintf(filename,"gpu_state%05d.dat",it);
#else
sprintf(filename,"cpu_state%05d.dat",it);
#endif
WriteState(filename);
}
PropagateDoubleStep(deltaTime);
}
time(&c1);
#ifdef GPU
fprintf(stderr,"GPU propagation for %d particles over %d doublesteps took %.0f s\n",NUMPART,nt,difftime(c1,c0));
FILE *fd=fopen("gpu_stats.dat","a");
fprintf(fd,"%d %d %d %f\n",NUMPART,GROUP_SIZE,nt,difftime(c1,c0));
fclose(fd);
#else
fprintf(stderr,"CPU propagation for %d particles over %d doublesteps took %.0f s\n",NUMPART,nt,difftime(c1,c0));
FILE *fd=fopen("cpu_stats.dat","a");
fprintf(fd,"%d %d %d %f\n",NUMPART,GROUP_SIZE,nt,difftime(c1,c0));
fclose(fd);
#endif
}

int main( int argc, char **argv )
{
if (argc==3)
{
GROUP_SIZE=atoi(argv[1]);
NUMPART=GROUP_SIZE*atoi(argv[2]);
}

PropagateNBodySystem Plasma;
Plasma.RunSimulation();

return 0;
}
```

The header file “CL/cl.hpp” is assumed to be available for the C++ bindings of OpenCL. Two other header files are optional for debugging/information purposes:

```/* opencl_error.h */
const char* oclErrorString(cl_int error)
{
static const char* errorString[] = {
"CL_SUCCESS",
"CL_DEVICE_NOT_FOUND",
"CL_DEVICE_NOT_AVAILABLE",
"CL_COMPILER_NOT_AVAILABLE",
"CL_MEM_OBJECT_ALLOCATION_FAILURE",
"CL_OUT_OF_RESOURCES",
"CL_OUT_OF_HOST_MEMORY",
"CL_PROFILING_INFO_NOT_AVAILABLE",
"CL_MEM_COPY_OVERLAP",
"CL_IMAGE_FORMAT_MISMATCH",
"CL_IMAGE_FORMAT_NOT_SUPPORTED",
"CL_BUILD_PROGRAM_FAILURE",
"CL_MAP_FAILURE",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"CL_INVALID_VALUE",
"CL_INVALID_DEVICE_TYPE",
"CL_INVALID_PLATFORM",
"CL_INVALID_DEVICE",
"CL_INVALID_CONTEXT",
"CL_INVALID_QUEUE_PROPERTIES",
"CL_INVALID_COMMAND_QUEUE",
"CL_INVALID_HOST_PTR",
"CL_INVALID_MEM_OBJECT",
"CL_INVALID_IMAGE_FORMAT_DESCRIPTOR",
"CL_INVALID_IMAGE_SIZE",
"CL_INVALID_SAMPLER",
"CL_INVALID_BINARY",
"CL_INVALID_BUILD_OPTIONS",
"CL_INVALID_PROGRAM",
"CL_INVALID_PROGRAM_EXECUTABLE",
"CL_INVALID_KERNEL_NAME",
"CL_INVALID_KERNEL_DEFINITION",
"CL_INVALID_KERNEL",
"CL_INVALID_ARG_INDEX",
"CL_INVALID_ARG_VALUE",
"CL_INVALID_ARG_SIZE",
"CL_INVALID_KERNEL_ARGS",
"CL_INVALID_WORK_DIMENSION",
"CL_INVALID_WORK_GROUP_SIZE",
"CL_INVALID_WORK_ITEM_SIZE",
"CL_INVALID_GLOBAL_OFFSET",
"CL_INVALID_EVENT_WAIT_LIST",
"CL_INVALID_EVENT",
"CL_INVALID_OPERATION",
"CL_INVALID_GL_OBJECT",
"CL_INVALID_BUFFER_SIZE",
"CL_INVALID_MIP_LEVEL",
"CL_INVALID_GLOBAL_WORK_SIZE",
};
const int errorCount = sizeof(errorString) / sizeof(errorString[0]);
const int index = -error;
return (index >= 0 && index < errorCount) ? errorString[index] : "Unspecified Error";
}
```

and the diagnostics about the available OpenCL device, adapted partly from the AMD sample “CLInfo”

```/* print_info.h

uses parts of CLInfo.cpp contained in  the AMD stream sdk v2.3 with copyright notice:

Redistribution and use of this material is permitted under the following conditions:

Redistributions must retain the above copyright notice and all terms of this license.

In no event shall anyone redistributing or accessing or using this material
commence or participate in any arbitration or legal action relating to this
contributors. The foregoing shall survive any expiration or termination of
this license or any agreement or access or use related to this material.

ANY BREACH OF ANY TERM OF THIS LICENSE SHALL RESULT IN THE IMMEDIATE REVOCATION
OF ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL.

THIS MATERIAL IS PROVIDED BY ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT
HOLDERS AND CONTRIBUTORS "AS IS" IN ITS CURRENT CONDITION AND WITHOUT ANY
REPRESENTATIONS, GUARANTEE, OR WARRANTY OF ANY KIND OR IN ANY WAY RELATED TO
SUPPORT, INDEMNITY, ERROR FREE OR UNINTERRUPTED OPERA TION, OR THAT IT IS FREE
FROM DEFECTS OR VIRUSES.  ALL OBLIGATIONS ARE HEREBY DISCLAIMED - WHETHER
EXPRESS, IMPLIED, OR STATUTORY - INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
ACCURACY, COMPLETENESS, OPERABILITY, QUALITY OF SERVICE, OR NON-INFRINGEMENT.
IN NO EVENT SHALL ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, PUNITIVE,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, REVENUE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED OR BASED ON ANY THEORY OF LIABILITY
ARISING IN ANY WAY RELATED TO THIS MATERIAL, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE. THE ENTIRE AND AGGREGATE LIABILITY OF ADVANCED MICRO DEVICES,
INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS SHALL NOT EXCEED TEN DOLLARS
(US \$10.00). ANYONE REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL ACCEPTS
THIS ALLOCATION OF RISK AND AGREES TO RELEASE ADVANCED MICRO DEVICES, INC. AND
ANY COPYRIGHT HOLDERS AND CONTRIBUTORS FROM ANY AND ALL LIABILITIES,
OBLIGATIONS, CLAIMS, OR DEMANDS IN EXCESS OF TEN DOLLARS (US \$10.00). THE
FOREGOING ARE ESSENTIAL TERMS OF THIS LICENSE AND, IF ANY OF THESE TERMS ARE
CONSTRUED AS UNENFORCEABLE, FAIL IN ESSENTIAL PURPOSE, OR BECOME VOID OR
CONTRIBUTORS FOR ANY REASON, THEN ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE
THIS MATERIAL SHALL TERMINATE IMMEDIATELY. MOREOVER, THE FOREGOING SHALL
SURVIVE ANY EXPIRATION OR TERMINATION OF THIS LICENSE OR ANY AGREEMENT OR
ACCESS OR USE RELATED TO THIS MATERIAL.
NOTICE IS HEREBY PROVIDED, AND BY REDISTRIBUTING OR ACCESSING OR USING THIS
MATERIAL SUCH NOTICE IS ACKNOWLEDGED, THAT THIS MATERIAL MAY BE SUBJECT TO
RESTRICTIONS UNDER THE LAWS AND REGULATIONS OF THE UNITED STATES OR OTHER
COUNTRIES, WHICH INCLUDE BUT ARE NOT LIMITED TO, U.S. EXPORT CONTROL LAWS SUCH
AS THE EXPORT ADMINISTRATION REGULATIONS AND NATIONAL SECURITY CONTROLS AS
DEFINED THEREUNDER, AS WELL AS STATE DEPARTMENT CONTROLS UNDER THE U.S.
MUNITIONS LIST. THIS MATERIAL MAY NOT BE USED, RELEASED, TRANSFERRED, IMPORTED,
EXPORTED AND/OR RE-EXPORTED IN ANY MANNER PROHIBITED UNDER ANY APPLICABLE LAWS,
INCLUDING U.S. EXPORT CONTROL LAWS REGARDING SPECIFICALLY DESIGNATED PERSONS,
COUNTRIES AND NATIONALS OF COUNTRIES SUBJECT TO NATIONAL SECURITY CONTROLS.
MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF ANY
LICENSE OR AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL.

NOTICE REGARDING THE U.S. GOVERNMENT AND DOD AGENCIES: This material is
provided with "RESTRICTED RIGHTS" and/or "LIMITED RIGHTS" as applicable to
computer software and technical data, respectively. Use, duplication,
distribution or disclosure by the U.S. Government and/or DOD agencies is
subject to the full extent of restrictions in all applicable regulations,
including those found at FAR52.227 and DFARS252.227 et seq. and any successor
regulations thereof. Use of this material by the U.S. Government and/or DOD
agencies is acknowledgment of the proprietary rights of any copyright holders
and contributors, including those of Advanced Micro Devices, Inc., as well as
the provisions of FAR52.227-14 through 23 regarding privately developed and/or
commercial computer software.

This license forms the entire agreement regarding the subject matter hereof and
supersedes all proposals and prior discussions and writings between the parties
with respect thereto. This license does not affect any ownership, rights, title,
or interest in, or relating to, this material. No terms of this license can be
modified or waived, and no breach of this license can be excused, unless done
so in a writing signed by all affected parties. Each term of this license is
separately enforceable. If any term of this license is determined to be or
becomes unenforceable or illegal, such term shall be reformed to the minimum
extent necessary in order for this license to remain in effect in accordance
with its terms as modified by such reformation. This license shall be governed
by and construed in accordance with the laws of the State of Texas without
regard to rules on conflicts of law of any state or jurisdiction or the United
Nations Convention on the International Sale of Goods. All disputes arising out
of this license shall be subject to the jurisdiction of the federal and state
courts in Austin, Texas, and all defenses are hereby waived concerning personal
jurisdiction and venue of these courts.
*/
// Iteratate over platforms
std::cout << "Number of platforms:\t\t\t\t "
<< platforms.size()
<< std::endl;
for (cl::vector<cl::Platform>::iterator i = platforms.begin(); i != platforms.end(); ++i)
{

std::cout << "  Platform Profile:\t\t\t\t "
<< (*i).getInfo<CL_PLATFORM_PROFILE>().c_str()
<< std::endl;
std::cout << "  Platform Version:\t\t\t\t "
<< (*i).getInfo<CL_PLATFORM_VERSION>().c_str()
<< std::endl;
std::cout << "  Platform Name:\t\t\t\t\t "
<< (*i).getInfo<CL_PLATFORM_NAME>().c_str()
<< std::endl;
std::cout << "  Platform Vendor:\t\t\t\t "
<< (*i).getInfo<CL_PLATFORM_VENDOR>().c_str() << std::endl;
if ((*i).getInfo<CL_PLATFORM_EXTENSIONS>().size() > 0)
{
std::cout << "  Platform Extensions:\t\t\t "
<< (*i).getInfo<CL_PLATFORM_EXTENSIONS>().c_str()
<< std::endl;
}
}
std::cout << std::endl << std:: endl;
// Now Iteratate over each platform and its devices
for (cl::vector<cl::Platform>::iterator p = platforms.begin(); p != platforms.end(); ++p)
{
std::cout << "  Platform Name:\t\t\t\t\t "
<< (*p).getInfo<CL_PLATFORM_NAME>().c_str()
<< std::endl;

cl::vector<cl::Device> devices;
(*p).getDevices(CL_DEVICE_TYPE_ALL, &devices);

std::cout << "Number of devices:\t\t\t\t " << devices.size() << std::endl;
for (cl::vector<cl::Device>::iterator i = devices.begin(); i != devices.end(); ++i)
{
/* Get device name */
std::string deviceName = (*i).getInfo<CL_DEVICE_NAME>();
cl_device_type dtype = (*i).getInfo<CL_DEVICE_TYPE>();

/* Get driver version in int */
std::string driverVersion = (*i).getInfo<CL_DRIVER_VERSION>();
std::string calVersion(driverVersion.c_str());
calVersion = calVersion.substr(calVersion.find_last_of(".") + 1);
//int version = atoi(calVersion.c_str());
std::cout << "  Device Type:\t\t\t\t\t " ;
switch (dtype)
{
case CL_DEVICE_TYPE_ACCELERATOR:
std::cout << "CL_DEVICE_TYPE_ACCRLERATOR" << std::endl;
break;
case CL_DEVICE_TYPE_CPU:
std::cout << "CL_DEVICE_TYPE_CPU" << std::endl;
break;
case CL_DEVICE_TYPE_DEFAULT:
std::cout << "CL_DEVICE_TYPE_DEFAULT" << std::endl;
break;
case CL_DEVICE_TYPE_GPU:
std::cout << "CL_DEVICE_TYPE_GPU" << std::endl;
break;
}
std::cout << "  Device ID:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_VENDOR_ID>()
<< std::endl;
std::cout << "  Max compute units:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()
<< std::endl;
std::cout << "  Max work items dimensions:\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS>()
<< std::endl;
cl::vector< ::size_t> witems =
(*i).getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
for (unsigned int x = 0; x < (*i).getInfo<CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS>(); x++)
{
std::cout << "    Max work items["
<< x << "]:\t\t\t\t "
<< witems[x]
<< std::endl;
}
std::cout << "  Max work group size:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>()
<< std::endl;
std::cout << "  Preferred vector width char:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR>()
<< std::endl;
std::cout << "  Preferred vector width short:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT>()
<< std::endl;
std::cout << "  Preferred vector width int:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT>()
<< std::endl;
std::cout << "  Preferred vector width long:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG>()
<< std::endl;
std::cout << "  Preferred vector width float:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT>()
<< std::endl;
std::cout << "  Preferred vector width double:\t\t "
<< (*i).getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE>()
<< std::endl;
std::cout << "  Max clock frequency:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
<< "Mhz"
<< std::endl;
std::cout << "  Address bits:\t\t\t\t\t "
<< std::endl;
std::cout << "  Max memory allocation:\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>()
<< std::endl;
std::cout << "  Image support:\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_IMAGE_SUPPORT>() ? "Yes" : "No")
<< std::endl;

if ((*i).getInfo<CL_DEVICE_IMAGE_SUPPORT>())
{
std::cout << "  Max number of images read arguments:\t "
<< std::endl;

std::cout << "  Max number of images write arguments:\t "
<< (*i).getInfo<CL_DEVICE_MAX_WRITE_IMAGE_ARGS>()
<< std::endl;

std::cout << "  Max image 2D width:\t\t\t "
<< (*i).getInfo<CL_DEVICE_IMAGE2D_MAX_WIDTH>()
<< std::endl;

std::cout << "  Max image 2D height:\t\t\t "
<< (*i).getInfo<CL_DEVICE_IMAGE2D_MAX_HEIGHT>()
<< std::endl;

std::cout << "  Max image 3D width:\t\t\t "
<< (*i).getInfo<CL_DEVICE_IMAGE3D_MAX_WIDTH>()
<< std::endl;

std::cout << "  Max image 3D height:\t "
<< (*i).getInfo<CL_DEVICE_IMAGE3D_MAX_HEIGHT>()
<< std::endl;

std::cout << "  Max image 3D depth:\t\t\t "
<< (*i).getInfo<CL_DEVICE_IMAGE3D_MAX_DEPTH>()
<< std::endl;

std::cout << "  Max samplers within kernel:\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_SAMPLERS>()
<< std::endl;
}
std::cout << "  Max size of kernel argument:\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_PARAMETER_SIZE>()
<< std::endl;

std::cout << "  Alignment (bits) of base address:\t\t "
<< std::endl;

std::cout << "  Minimum alignment (bytes) for any datatype:\t "
<< (*i).getInfo<CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE>()
<< std::endl;

std::cout << "  Single precision floating point capability" << std::endl;
std::cout << "    Denorms:\t\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_DENORM ? "Yes" : "No")
<< std::endl;
std::cout << "    Quiet NaNs:\t\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_INF_NAN ? "Yes" : "No")
<< std::endl;
std::cout << "    Round to nearest even:\t\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_ROUND_TO_NEAREST ? "Yes" : "No")
<< std::endl;
std::cout << "    Round to zero:\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_ROUND_TO_ZERO ? "Yes" : "No")
<< std::endl;
std::cout << "    Round to +ve and infinity:\t\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_ROUND_TO_INF ? "Yes" : "No")
<< std::endl;
std::cout << "    IEEE754-2008 fused multiply-add:\t\t "
<< ((*i).getInfo<CL_DEVICE_SINGLE_FP_CONFIG>() &
CL_FP_FMA ? "Yes" : "No")
<< std::endl;

std::cout << "  Cache type:\t\t\t\t\t " ;
switch ((*i).getInfo<CL_DEVICE_GLOBAL_MEM_CACHE_TYPE>())
{
case CL_NONE:
std::cout << "None" << std::endl;
break;
std::cout << "Read only" << std::endl;
break;
break;
}

std::cout << "  Cache line size:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE>()
<< std::endl;

std::cout << "  Cache size:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_GLOBAL_MEM_CACHE_SIZE>()
<< std::endl;

std::cout << "  Global memory size:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()
<< std::endl;

std::cout << "  Constant buffer size:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE>()
<< std::endl;

std::cout << "  Max number of constant args:\t\t\t "
<< (*i).getInfo<CL_DEVICE_MAX_CONSTANT_ARGS>()
<< std::endl;

std::cout << "  Local memory type:\t\t\t\t " ;
switch ((*i).getInfo<CL_DEVICE_LOCAL_MEM_TYPE>())
{
case CL_LOCAL:
break;
case CL_GLOBAL:
std::cout << "Global" << std::endl;
break;
}

std::cout << "  Local memory size:\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_LOCAL_MEM_SIZE>()
<< std::endl;

std::cout << "  Profiling timer resolution:\t\t\t "
<< (*i).getInfo<CL_DEVICE_PROFILING_TIMER_RESOLUTION>()
<< std::endl;

std::cout << "  Device endianess:\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_ENDIAN_LITTLE>() ? "Little" : "Big")
<< std::endl;

std::cout << "  Available:\t\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_AVAILABLE>() ? "Yes" : "No")
<< std::endl;

std::cout << "  Compiler available:\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_COMPILER_AVAILABLE>() ? "Yes" : "No")
<< std::endl;

std::cout << "  Execution capabilities:\t\t\t\t " << std::endl;
std::cout << "    Execute OpenCL kernels:\t\t\t "
<< ((*i).getInfo<CL_DEVICE_EXECUTION_CAPABILITIES>() &
CL_EXEC_KERNEL ? "Yes" : "No")
<< std::endl;
std::cout << "    Execute native function:\t\t\t "
<< ((*i).getInfo<CL_DEVICE_EXECUTION_CAPABILITIES>() &
CL_EXEC_NATIVE_KERNEL ? "Yes" : "No")
<< std::endl;

std::cout << "  Queue properties:\t\t\t\t " << std::endl;
std::cout << "    Out-of-Order:\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_QUEUE_PROPERTIES>() &
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE ? "Yes" : "No")
<< std::endl;
std::cout << "    Profiling :\t\t\t\t\t "
<< ((*i).getInfo<CL_DEVICE_QUEUE_PROPERTIES>() &
CL_QUEUE_PROFILING_ENABLE ? "Yes" : "No")
<< std::endl;

std::cout << "  Platform ID:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_PLATFORM>()
<< std::endl;

std::cout << "  Name:\t\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_NAME>().c_str()
<< std::endl;

std::cout << "  Vendor:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_VENDOR>().c_str()
<< std::endl;

std::cout << "  Driver version:\t\t\t\t "
<< (*i).getInfo<CL_DRIVER_VERSION>().c_str()
<< std::endl;

std::cout << "  Profile:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_PROFILE>().c_str()
<< std::endl;

std::cout << "  Version:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_VERSION>().c_str()
<< std::endl;

std::cout << "  Extensions:\t\t\t\t\t "
<< (*i).getInfo<CL_DEVICE_EXTENSIONS>().c_str()
<< std::endl;

}
std::cout << std::endl << std::endl;
}
```

If it is not already installed on your system, you can download cl.hpp from the Khronos OpenCL consortium page and place it along with a copy of cl.h (which you should already have) in a subfolder “CL” on your computer.
You should have in your test directory the files

1. plasma_disk.cpp
2. integrate_eom_kernel.cl
3. opencl_error.h
4. print_info.h

and populate the subdirectory CL by
“mkdir ./CL” then “cp /usr/include/CL/* ./CL” and then make sure that it contains also the “cl.hpp” file. Now you can compile a version of the program for GPUs (NVIDIA or AMD) with

• g++ -DGPU -o gpu_plasma plasma_disk.cpp -I. -L/PATH/TO/OPENCL/GPU/INSTALLATION -lOpenCL

Be careful if you have multiple versions (CPU and GPU versions of libOpenCL installed. I check via

• ldd gpu_plasma

which one the binary links to and then you have to make sure that the LD_LIBRARY_PATH is accordingly adjusted before you start the program. In addition, the /etc/OpenCL/vendor/ directory must contain the registration files for the OpenCL installation.

Running the program requires to pass two command line arguments, the first is the so-called workgroup size and the second one is the number of plasma particles in units of multiples of the workgroup size.
For example a reasonable command line with NVIDIA hardware reads

• ./gpu_plasma 256 16

which will run the simulation for 256*16=4096 particles.
You can also generate a CPU OpenCL version (attention: in contrast to the previous post, this really requires an existent CPU OpenCL impementation installed) via

• g++ -o cpu_plasma plasma_disk.cpp -I. -L/PATH/TO/OPENCL/CPU/INSTALLATION -lOpenCL

For the CPU OpenCL runs, you should choose a small workgroup size, for example

• /usr/bin/time -f “wall seconds %e” ./cpu_plasma 1 4096

will keep you CPU busy for a while. If you run “top” while the program is running, you can monitor the CPU usage. The Intel OpenCL implementation uses on my desktop system 3 cores out of the 4 available ones.

Now to the fun part: benchmark numbers (please feel free to email me yours or to submit a comment, which I can then to update this post):
UPDATE: It turns out that the run time is a bit tricky to measure. For the moment, I am resorting to the elapsed wall time of the complete program, which is obtained by

• /usr/bin/time -f “wall seconds %e” ./gpu_plasma 256 100

On an NVIDIA C2075 (NVIDIA OpenCL 1.0) card the program call “./gpu_plasma 256 100” (25600 particles!) takes 77 seconds. Cathal Ó Broin reports on an AMD/ATI 6970 system (1536 Stream Processors arranged into 24 compute units with the core clock at 880 MHz) run-times for “./gpu_plasma 256 100” call 85 seconds.

Finally, for comparison: “./cpu_plasma 1 4096” with the AMD Stream toolkit on an 8-core XEON Intel(R) X5450 @3.00GHz takes ca 56 wall-clock seconds using 7 CPU cores. If I get more data points, we can supplement this table. Disclaimer: this table does not provide a general purpose benchmark, it is specific for the discussed code and the drivers used in conjunction. On the same hardware, the same algorithm might run faster for example if coded with CUDA.

device program call elapsed wall time
NVIDIA C2075 (NV OpenCL 1.1 driver) ./gpu_plasma 256 100 125 seconds
NVIDIA C2070 (NV OpenCL 1.0 driver) ./gpu_plasma 256 100 77 seconds
AMD ATI 6970 ./gpu_plasma 256 100 85 seconds
XEON Intel(R) X5450 @3.00GHz (using 8 cores) ./cpu_plasma 1 25600 1830 seconds

For convenience some links to OpenCL implementations:

Also some links to discussions of more advanced OpenCL topics (some implementations are fast moving, thus information is easily outdated). The links are provided in the hope that the information may be of interest to others. I do not necessarily endorse the views expressed or the facts presented on the linked sites.