Computational physics on the smartphone GPU
I recently got hold of a Nexus 4 smartphone, which features a GPU (Qualcomm Adreno 320) and conveniently ships with already installed OpenCL library. With minimal changes I got the previously discussed many-body program code related to the fractional quantum Hall effect up and running. No unrooting of the phone is required to run the code example. Please use the following recipe at your own risk, I don’t accept any liabilities. Here is what I did:
- Download and unpack the Android SDK from google for cross-compilation (my host computer runs Mac OS X).
- Download and unpack the Android NDK from google to build minimal C/C++ programs without Java (no real app).
- Install the standalone toolchain from the Android NDK. I used the following command for my installation:
/home/tkramer/android-ndk-r8d/build/tools/make-standalone-toolchain.sh \ --install-dir=/home/tkramer/android-ndk-standalone
- Put the OpenCL programs and source code in an extra directory, as described in my previous post
- Change one line in the cl.hpp header: instead of including <GL/gl.h> change to <GLES/gl.h>. Note: I am using the “old” cl.hpp bindings 1.1, further changes might be required for the newer bindings, see for instance this helpful blog
- Transfer the OpenCL library from the phone to a subdirectory lib/ inside your source code. To do so append the path to your SDK tools and use the adb command:
export PATH=/home/tkramer/adt-bundle-mac-x86_64-20130219/sdk/platform-tools:$PATH adb pull /system/lib/libOpenCL.so
- Cross compile your program. I used the following script, please feel free to provide shorter versions. Adjust the include directories and library directories for your installation.
rm plasma_disk_gpu /home/tkramer/android-ndk-standalone/bin/arm-linux-androideabi-g++ -v -g \ -DCL_USE_DEPRECATED_OPENCL_1_1_APIS -DGPU \ -I. \ -I/home/tkramer/android-ndk-standalone/include/c++/4.6 \ -I/home/tkramer/android-ndk-r8d/platforms/android-5/arch-arm/usr/include \ -Llib \ -march=armv7-a -mfloat-abi=softfp -mfpu=neon \ -fpic -fsigned-char -fdata-sections -funwind-tables -fstack-protector \ -ffunction-sections -fdiagnostics-show-option -fPIC \ -fno-strict-aliasing -fno-omit-frame-pointer -fno-rtti \ -lOpenCL \ -o plasma_disk_gpu plasma_disk.cpp
- Copy the executable to the data dir of your phone to be able to run it. This can be done without rooting the phone with the nice SSHDroid App, which by defaults transfers to /data . Don’t forget to copy the kernel .cl files:
scp -P 2222 integrate_eom_kernel.cl root@192.168.0.NNN: scp -P 2222 plasma_disk_gpu root@192.168.0.NNN:
- ssh into your phone and run the GPU program:
ssh -p 2222 root@192.168.0.NNN ./plasma_disk_gpu 64 16
- Check the resulting data files. You can copy them for example to the Download path of the storage and use the gnuplot (droidplot App) to plot them.
A short note about runtimes. On the Nexus 4 device the program runs for about 12 seconds, on a MacBook Pro with NVIDIA GT650M it completes in 2 seconds (in the example above the equations of motion for 16*64=1024 interacting particles are integrated). For larger particle numbers the phone often locks up.
An alternative way to transfer files to the device is to connect via USB cable and to install the Android Terminal Emulator app. Next
cd /data/data/jackpal.androidterm mkdir gpu chmod 777 gpu
On the host computer use adb to transfer the compiled program and the .cl kernel and start a shell to run the kernel
adb push integrate_eom_kernel.cl /data/data/jackpal.androidterm/gpu/ adb push plasma_disk_gpu /data/data/jackpal.androidterm/gpu/
You can either run the program within the terminal emulator or use the adb shell
adb shell cd /data/data/jackpal.androidterm/gpu/ ./plasma_disk_gpu 64 16
Lets see in how many years todays desktop GPUs can be found in smartphones and which computational physics codes can be run!
Computational physics & GPU programming: exciton lab for light-harvesting complexes (GPU-HEOM) goes live on nanohub.org
Christoph Kreisbeck and I are happy to announce the public availability of the Exciton Dynamics Lab for Light-
Harvesting Complexes (GPU-HEOM) hosted on nanohub.org. You need to register a user account (its free), and then you are ready to use GPU-HEOM for the Frenkel exciton model of light harvesting complexes. In release 1.0 we support
- calculating population dynamics
- tracking coherences between two eigenstates
- obtaining absorption spectra
- two-dimensional echo spectra (including excited state absorption)
- … and all this for general vibronic spectral densities parametrized by shifted Lorentzians.
I will post some more entries here describing how to use the tool for understanding how the spectral density affects the lifetime of electronic coherences (see also this blog entry).
In the supporting document section you find details of the implemented method and the assumptions underlying the tool. We are appreciating your feedback for further improving the tool.
We are grateful for the support of Prof. Gerhard Klimeck, Purdue University, director of the Network for Computational Nanotechnology to bring GPU computing to nanohub (I believe our tool is the first GPU enabled one at nanohub).
If you want to refer to the tool you can cite it as:
Christoph Kreisbeck; Tobias Kramer (2013), “Exciton Dynamics Lab for Light-Harvesting Complexes (GPU-HEOM),” http://nanohub.org/resources/gpuheompop. (DOI: 10.4231/D3RB6W248).
and you find further references in the supporting documentation.
I very much encourage my colleagues developing computer programs for theoretical physics and chemistry to make them available on platforms such as nanohub.org. In my view, it greatly facilitates the comparison of different approaches and is the spirit of advancing science by sharing knowledge and providing reproducible data sets.
Good or bad vibrations for the Fenna-Matthews-Olson complex?

Time-evolution of the coherence for the FMO complex (eigenstates 1 and 5 ) calculated with GPU-HEOM by Kreisbeck and Kramer, J. Phys. Chem Lett. 3, 2828 (2012).
Due to its known structure and relative simplicity, the Fenna-Matthews-Olson complex of green sulfur bacteria provides an interesting test-case for our understanding of excitonic energy transfer in a light-harvesting complex.
The experimental pump-probe spectra (discussed in my previous post catching and tracking light: following the excitations in the Fenna-Matthews-Olson complex) show long-lasting oscillatory components and this finding has been a puzzle for theoretician and led to a refinement of the well-established models. These models show a reasonable agreement with the data and the rate equations explain the relaxation and transfer of excitonic energy to the reaction center.
However, the rate equations are based on estimates for the relaxation and dephasing rates. As Christoph Kreisbeck and I discuss in our article Long-Lived Electronic Coherence in Dissipative Exciton-Dynamics of Light-Harvesting Complexes (arxiv version), an exact calculation with GPU-HEOM following the best data for the Hamiltonian allows one to determine where the simple approach is insufficient and to identify a key-factor supporting electronic coherence:

Important features in the spectral density of the FMO complex related to the persistence of cross-peak oscillations in 2d spectra.
It’s the vibronic spectral density – redrawn (in a different unit convention, multiplied by ω2) from the article by M. Wendling from the group of Prof. Rienk van Grondelle. We did undertake a major effort to proceed in our calculations as close to the measured shape of the spectral density as the GPU-HEOM method allows one. By comparison of results for different forms of the spectral density, we identify how the different parts of the spectral density lead to distinct signatures in the oscillatory coherences. This is illustrated in the figure on the rhs. To get long lasting oscillations and finally to relax, three ingredients are important
- a small slope towards zero frequency, which suppresses the pure dephasing.
- a high plateau in the region where the exciton energy differences are well coupled. This leads to relaxation.
- the peaked structures induce a “very-long-lasting” oscillatory component, which is shown in the first figure. In our analysis we find that this is a persistent, but rather small (<0.01) modulation.
2d spectra are smart objects

FMO spectrum calculated with GPU-HEOM for a 3 peak approximation of the measured spectral density, including disorder averaging but no excited state absorption.
The calculation of 2d echo spectra requires considerable computational resources. Since theoretically calculated 2d spectra are needed to check how well theory and experiment coincide, I conclude with showing a typical spectrum we obtain (including static disorder, but no excited state absorption for this example). One interesting finding is that 2d spectra are able to differentiate between the different spectral densities. For example for a a single-peak Drude-Lorentz spectral density (sometimes chosen for computational convenience), the wrong peaks oscillate and the life-time of cross-peak oscillations is short (and becomes even shorter with longer vibronic memory). But this is for the experts only, see the supporting information of our article.
Are vibrations good or bad? Probably both… The pragmatic answer is that the FMO complex lives in an interesting parameter regime. The exact calculations within the Frenkel exciton model do confirm the well-known dissipative energy transfer picture. But on the other hand the specific spectral density of the FMO complex supports long-lived coherences (at least if the light source is a laser beam), which require considerable theoretical and experimental efforts to be described and measured. Whether the seen coherence has any biological relevance is an entirely different topic… maybe the green-sulfur bacteria are just enjoying a glimpse into Schrödinger’s world of probabilistic uncertainty.
Computational physics & GPU programming: interacting many-body simulation with OpenCL

Trajectories in a two-dimensional interacting plasma simulation, reproducing the density and pair-distribution function of a Laughlin state relevant for the quantum Hall effect. Figure taken from Interacting electrons in a magnetic field: mapping quantum mechanics to a classical ersatz-system.
In the second example of my series on GPU programming for scientists, I discuss a short OpenCL program, which you can compile and run on the CPU and the GPUs of various vendors. This gives me the opportunity to perform some cross-platform benchmarks for a classical plasma simulation. You can expect dramatic (several 100 fold) speed-ups on GPUs for this type of system. This is one of the reasons why molecular dynamics code can gain quite a lot by incorporating the massively parallel-programming paradigm in the algorithmic foundations.
The Open Computing Language (OpenCL) is relatively similar to its CUDA pendant, in practice the setup of an OpenCL kernel requires some housekeeping work, which might make the code look a bit more involved. I have based my interacting electrons calculation of transport in the Hall effect on an OpenCL code. Another examples is An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs (arxiv version) by C. Ó Broin and L.A.A. Nikolopoulos.
Now to the coding of a two-dimensional plasma simulation, which is inspired by Laughlin’s mapping of a many-body wave function to an interacting classical ersatz dynamics (for some context see my short review Interacting electrons in a magnetic field: mapping quantum mechanics to a classical ersatz-system on the arxiv).
Computational physics & GPU programming: Solving the time-dependent Schrödinger equation
I start my series on the physics of GPU programming by a relatively simple example, which makes use of a mix of library calls and well-documented GPU kernels. The run-time of the split-step algorithm described here is about 280 seconds for the CPU version (Intel(R) Xeon(R) CPU E5420 @ 2.50GHz), vs. 10 seconds for the GPU version (NVIDIA(R) Tesla C1060 GPU), resulting in 28 fold speed-up! On a C2070 the run time is less than 5 seconds, yielding an 80 fold speedup.

Autocorrelation function C(t) of a Gaussian wavepacket in a uniform force field. I compare the GPU and CPU results using the wavepacket code.
The description of coherent electron transport in quasi two-dimensional electron gases requires to solve the Schrödinger equation in the presence of a potential landscape. As discussed in my post Time to find eigenvalues without diagonalization, our approach using wavepackets allows one to obtain the scattering matrix over a wide range of energies from a single wavepacket run without the need to diagonalize a matrix. In the following I discuss the basic example of propagating a wavepacket and obtaining the autocorrelation function, which in turn determines the spectrum. I programmed the GPU code in 2008 as a first test to evaluate the potential of GPGPU programming for my research. At that time double-precision floating support was lacking and the fast Fourier transform (FFT) implementations were little developed. Starting with CUDA 3.0, the program runs fine in double precision and my group used the algorithm for calculating electron flow through nanodevices. The CPU version was used for our articles in Physica Scripta Wave packet approach to transport in mesoscopic systems and the Physical Review B Phase shifts and phase π-jumps in four-terminal waveguide Aharonov-Bohm interferometers among others.
Here, I consider a very simple example, the propagation of a Gaussian wavepacket in a uniform potential , for which the autocorrelation function of the initial state
is known in analytic form:
.
Read the rest of this entry »
The physics of GPU programming

Me pointing at the GPU Resonance cluster at SEAS Harvard with 32x448=14336 processing cores. Just imagine how tightly integrated this setup is compared to 3584 quad-core computers. Picture courtesy of Academic Computing, SEAS Harvard.
From discussions I learn that while many physicists have heard of Graphics Processing Units as fast computers, resistance to use them is widespread. One of the reasons is that physics has been relying on computers for a long time and tons of old, well trusted codes are lying around which are not easily ported to the GPU. Interestingly, the adoption of GPUs happens much faster in biology, medical imaging, and engineering.
I view GPU computing as a great opportunity to investigate new physics and my feeling is that todays methods optimized for serial processors may need to be replaced by a different set of standard methods which scale better with massively parallel processors. In 2008 I dived into GPU programming for a couple of reasons:
- As a “model-builder” the GPU allows me to reconsider previous limitations and simplifications of models and use the GPU power to solve the extended models.
- The turn-around time is incredibly fast. Compared to queues in conventional clusters where I wait for days or weeks, I get back results with 10000 CPU hours compute time the very same day. This in turn further facilitates the model-building process.
- Some people complain about the strict synchronization requirements when running GPU codes. In my view this is an advantage, since essentially no messaging overhead exists.
- If you want to develop high-performance algorithm, it is not good enough to convert library calls to GPU library calls. You might get speed-ups of about 2-4. However, if you invest the time and develop your own know-how you can expect much higher speed-ups of around 100 times or more, as seen in the applications I discussed in this blog before.
This summer I will lecture about GPU programming at several places and thus I plan to write a series of GPU related posts. I do have a complementary background in mathematical physics and special functions, which I find very useful in relation with GPU programming since new physical models require a stringent mathematical foundation and numerical studies.
Catching and tracking light: following the excitations in the Fenna-Matthews-Olson complex

The animation shows how peaks in the 2d echo-spectra are oscillation and changing for various delay times. For a full explanation, see Modelling of Oscillations in Two-Dimensional Echo-Spectra of the Fenna-Matthews-Olson Complex by B.Hein, C. Kreisbeck, T. Kramer, M. Rodríguez, New J. of Phys., 14, 023018 (2012), open access.
Efficient and fast transport of electric current is a basic requirement for the functioning of nanodevices and biological systems. A neat example is the energy-transport of a light-induced excitation in the Fenna-Matthews-Olson complex of green sulfur bacteria. This process has been elucidated by pump-probe spectroscopy. The resulting spectra contain an enormous amount of information about the couplings of the different pigments and the pathways taken by the excitation. The basic guide to a 2d echo-spectrum is as follows:
You can find peaks of high intensity along the diagonal line which are roughly representing a more common absorption spectrum. If you delay the pump and probe pulses by several picoseconds, you will find a new set of peaks at a horizontal axis which indicates that energy of the excitation gets redistributed and the system relaxes and transfers part of the energy to vibrational motion. This process is nicely visible in the spectra recorded by Brixner et al.
A lot of excitement and activity on photosynthetic complexes was triggered by experiments of Engel et al showing that besides the relaxation process also periodic oscillations are visible in the oscillations for more than a picosecond.
What is causing the oscillations in the peak amplitudes of 2d echo-spectra in the Fenna-Matthews Olson complex?
A purely classical transport picture should not show such oscillations and the excitation instead hops around the complex without interference. Could the observed oscillations point to a different transport mechanism, possibly related to the quantum-mechanical simultaneous superposition of several transport paths?
The initial answer from the theoretical side was no, since within simplified models the thermalization occurs fast and without oscillations. It turned out that the simple calculations are a bit too simplistic to describe the system accurately and exact solutions are required. But exact solutions (even for simple models) are difficult to obtain. Known exact methods such as DMRG work only reliable at very low temperatures (-273 C), which are not directly applicable to biological systems. Other schemes use the famous path integrals but are too slow to calculate the pump-probe signals.
Our contribution to the field is to provide an exact computation of the 2d echo-spectra at the relevant temperatures and to see the difference to the simpler models in order to quantify how much coherence is preserved. From the method-development the computational challenge is to speed-up the calculations several hundred times in order to get results within days of computational run-time. We did achieve this by developing a method which we call GPU-hierarchical equations of motion (GPU-HEOM). The hierarchical equations of motions are a nice scheme to propagate a density matrix under consideration of non-Markovian effects and strong couplings to the environment. The HEOM scheme was developed by Kubo, Tanimura, and Ishizaki (Prof. Tanimura has posted some material on HEOM here).
However, the original computational method suffers from the same problems as path-integral calculations and is rather slow (though the HEOM method can be made faster and applied to electronic systems by using smart filtering as done by Prof. YiJing Yan). The GPU part in GPU-HEOM stands for Graphics Processing Units. Using our GPU adoption of the hierarchical equations (see details in Kreisbeck et al[JCTC, 7, 2166 (2011)] ) allowed us to cut down computational times dramatically and made it possible to perform a systematic study of the oscillations and the influence of temperature and disorder in our recent article Hein et al [New J. of Phys., 14, 023018 (2012), open access] .

