With increasing computational power of massively-parallel computers, a more accurate modeling of the energy-transfer dynamics in larger and more complex photosynthetic systems (=light-harvesting complexes) becomes feasible – provided we choose the right algorithms and tools.
The diverse character of hardware found in high-performance computers (hpc) seemingly requires to rewrite program code from scratch depending if we are targeting multi-core CPU systems, integrated many-core platforms (Xeon PHI/MIC), or graphics processing units (GPUs).
To avoid the defragmentation of our open quantum-system dynamics workhorse (see the previous GPU-HEOM posts) across the various hpc-platforms, we have transferred the GPU-HEOM CUDA code to the Open Compute Language (OpenCL). The resulting QMaster tool is described in our just published article Scalable high-performance algorithm for the simulation of exciton-dynamics. Application to the light harvesting complex II in the presence of resonant vibrational modes (collaboration of Christoph Kreisbeck, Tobias Kramer, Alan Aspuru-Guzik). This post details the computational challenges and lessons learnt, the application to the light-harvesting complex II found in spinach will be the topic of the next post.
In my experience, it is not uncommon to develop a nice GPU application for instance with CUDA, which later on is scaled up to handle bigger problem sizes. With increasing problem size also the memory demands increase and even the 12 GB provided by the Kepler K40 are finally exhausted. Upon reaching this point, two options are possible: (a) to distribute the memory across different GPU devices or (b) to switch to architectures which provide more device-memory. Option (a) requires substantial changes to existing program code to manage the distributed memory access, while option (b) in combination with OpenCL requires (in the best case) only to adapt the kernel-launch configuration to the different platforms.
QMaster implements an extension of the hierarchical equation of motion (HEOM) method originally proposed by Tanimura and Kubo, which involves many (small) matrix-matrix multiplications. For GPU applications, the usage of local memory and the optimal thread-grids for fast matrix-matrix multiplications have been described before and are used in QMaster (and the publicly available GPU-HEOM tool on nanohub.org). While for GPUs the best performance is achieved using shared/local memory and assign one thread to each matrix element, the multi-core CPU OpenCL variant performs better with fewer threads, but getting more work per thread done. Therefore we use for the CPU machines a thread-grid which computes one complete matrix product per thread (this is somewhat similar to following the “naive” approach given in NVIDIA’s OpenCL programming guide, chapter 2.5). This strategy did not work very well for the Xeon PHI/MIC OpenCL case, which requires additional data structure changes, as we learnt from discussions with the distributed algorithms and hpc experts in the group of Prof. Reinefeld at the Zuse-Institute in Berlin.
The good performance and scaling across the 64 CPU AMD opteron workstation positively surprised us and lays the groundwork to investigate the validity of approximations to the energy-transfer equations in the spinach light-harvesting system, the topic for the next post.