How a wave packet travels through a quantum electronic interferometer

Together with Christoph Kreisbeck and Rafael A Molina I have contributed a blog entry to the News and Views section of the Journal of Physics describing our most recent work on Aharonov-Bohm interferometer with an imbedded quantum dot (article, arxiv). Can you spot Schrödinger’s cat in the result?

Transition between the resistivity of the nanoring with and without embedded quantum dot. The vertical axis denotes the Fermi energy (controlled by a gate), while the horizontal axis scans through the magnetic field to induce phase differences between the pathways.

Dusting off cometary surfaces: collimated jets despite a homogeneous emission pattern.

Effective Gravitational potential of the comet (including the centrifugal contribution), the maximal value of the potential (red) is about 0.46 N/m, the minimal value (blue) 0.31 N/m computed with the methods described in this post.
Effective Gravitational potential of the comet (including the centrifugal contribution), the maximal value of the potential (red) is about 0.46 N/m, the minimal value (blue) 0.31 N/m computed with the methods described in this post. The rotation period is taken to be 12.4043 h. Image computed with the OpenCL cosim code. Image (C) Tobias Kramer (CC-BY SA 3.0 IGO).

Knowledge of GPGPU techniques is helpful for rapid model building and testing of scientific ideas. For example, the beautiful pictures taken by the ESA/Rosetta spacecraft of comet 67P/Churyumov–Gerasimenko reveal jets of dust particles emitted from the comet. Wouldn’t it be nice to have a fast method to simulate thousands of dust particles around the comet and to find out if already the peculiar shape of this space-potato influences the dust-trajectories by its gravitational potential? At the Zuse-Institut in Berlin we joined forces between the distributed algorithm and visual data analysis groups to test this idea. But first an accurate shape model of the comet 67P C-G is required. As published in his blog, Mattias Malmer has done amazing work to extract a shape-model from the published navigation camera images.

  1. Starting from the shape model by Mattias Malmer, we obtain a re-meshed model with fewer triangles on the surface (we use about 20,000 triangles). The key-property of the new mesh is a homogeneous coverage of the cometary surface with almost equally sized triangle meshes. We don’t want better resolution and adaptive mesh sizes at areas with more complex features. Rather we are considering a homogeneous emission pattern without isolated activity regions. This is best modeled by mesh cells of equal area. Will this prescription yield nevertheless collimated dust jets? We’ll see…
  2. To compute the gravitational potential of such a surface we follow this nice article by JT Conway. The calculation later on stays in the rotating frame anchored to the comet, thus in addition the centrifugal and Coriolis forces need to be included.
  3. To accelerate the method, OpenCL comes to the rescue and lets one compute many trajectories in parallel. What is required are physical conditions for the starting positions of the dust as it flies off the surface. We put one dust-particle on the center of each triangle on the surface and set the initial velocity along the normal direction to typically 2 or 4 m/s. This ensures that most particles are able to escape and not fall back on the comet.
  4. To visualize the resulting point clouds of dust particles we have programmed an OpenGL visualization tool. We compute the rotation and sunlight direction on the comet to cast shadows and add activity profiles to the comet surface to mask out dust originating from the dark side of the comet.

This is what we get for May 3, 2015. The ESA/NAVCAM image is taken verbatim from the Rosetta/blog.

Comparison of homogeneous dust model with ESA/NAVCAM Rosetta images.
Comparison of homogeneous dust mode (left panel)l with ESA/NAVCAM Rosetta images. (C) Left panel: Tobias Kramer and Matthias Noack 2015. Right panel: (C) ESA/NAVCAM team CC BY-SA 3.0 IGO, link see text.

Read more about the physics and results in our arxiv article T. Kramer et al.: Homogeneous Dust Emission and Jet Structure near Active Cometary Nuclei: The Case of 67P/Churyumov-Gerasimenko (submitted for publication) and grab the code to compute your own dust trajectories with OpenCL at

Slow or fast transfer: bottleneck states in light-harvesting complexes

Light-harvesting complex II, crystal structure 1RWT from Liu et al (Nature 2004, vol. 428, p. 287), rendered with VMD. The labels denote the designation of the chlorophyll sites (601-614). Chlorophylls 601,605-609 are of chlorophyll b type, the others of type a.

In the previous post I described some of the computational challenges for modeling energy transfer in the light harvesting complex II (LHCII) found in spinach. Here, I discuss the results we have obtained for the dynamics and choreography of excitonic energy transfer through the chlorophyll network. Compared to the Fenna-Matthews-Olson complex, LHCII has twice as many chlorophylls per monomeric unit (labeled 601-614 with chlorophyll a and b types).
Previous studies of exciton dynamics had to stick to simple exponential decay models based on either Redfield or Förster theory for describing the transfer from the Chl b to the Chl a sites. The results are not satisfying and conclusive, since depending on the method chosen the transfer time differs widely (tens of picoseconds vs picoseconds!).

Exciton dynamics in LHCII.
Exciton dynamics in LHCII computed with various methods. HEOM denotes the most accurate method, while Redfield and Förster approximations fail.

To resolve the discrepancies between the various approximate methods requires a more accurate approach. With the accelerated HEOM at hand, we revisited the problem and calculated the transfer rates. We find slower rates than given by the Redfield expressions. A combined Förster-Redfield description is possible in hindsight by using HEOM to identify a suitable cut-off parameter (Mcr=30/cm in this specific case).

Since the energy transfer is driven by the coupling of electronic degrees of freedom to vibrational ones, it of importance to assess how the vibrational mode distribution affects the transfer. In particular it has been proposed that specifically tuned vibrational modes might promote a fast relaxation. We find no strong impact of such modes on the transfer, rather we see (independent of the detailed vibrational structure) several bottleneck states, which act as a transient reservoir for the exciton flux. The details and distribution of the bottleneck states strongly depends on the parameters of the electronic couplings and differs for the two most commonly discussed LHCII models proposed by Novoderezhkin/Marin/van Grondelle and Müh/Madjet/Renger – both are considered in the article Scalable high-performance algorithm for the simulation of exciton-dynamics. Application to the light harvesting complex II in the presence of resonant vibrational modes (collaboration of Christoph Kreisbeck, Tobias Kramer, Alan Aspuru-Guzik).
Again, the correct assignment of the bottleneck states requires to use HEOM and to look beyond the approximate rate equations.

High-performance OpenCL code for modeling energy transfer in spinach

With increasing computational power of massively-parallel computers, a more accurate modeling of the energy-transfer dynamics in larger and more complex photosynthetic systems (=light-harvesting complexes) becomes feasible – provided we choose the right algorithms and tools.

OpenCL cross platform performance for tracking energy-transfer in the light-harvesting complex II found in spinach.
OpenCL cross platform performance for tracking energy-transfer in the light-harvesting complex II found in spinach, see Fig. 1 in the article . Shorter values show higher perfomance. The program code was originally written for massively-parallel GPUs, but performs also well on the AMD opteron setup. The Intel MIC OpenCL variant does not reach the peak performance (a different data-layout seems to be required to benefit from autovectorization).

The diverse character of hardware found in high-performance computers (hpc) seemingly requires to rewrite program code from scratch depending if we are targeting multi-core CPU systems, integrated many-core platforms (Xeon PHI/MIC), or graphics processing units (GPUs).

To avoid the defragmentation of our open quantum-system dynamics workhorse (see the previous GPU-HEOM posts) across the various hpc-platforms, we have transferred the GPU-HEOM CUDA code to the Open Compute Language (OpenCL). The resulting QMaster tool is described in our just published article Scalable high-performance algorithm for the simulation of exciton-dynamics. Application to the light harvesting complex II in the presence of resonant vibrational modes (collaboration of Christoph Kreisbeck, Tobias Kramer, Alan Aspuru-Guzik). This post details the computational challenges and lessons learnt, the application to the light-harvesting complex II found in spinach will be the topic of the next post.

In my experience, it is not uncommon to develop a nice GPU application for instance with CUDA, which later on is scaled up to handle bigger problem sizes. With increasing problem size also the memory demands increase and even the 12 GB provided by the Kepler K40 are finally exhausted. Upon reaching this point, two options are possible: (a) to distribute the memory across different GPU devices or (b) to switch to architectures which provide more device-memory. Option (a) requires substantial changes to existing program code to manage the distributed memory access, while option (b) in combination with OpenCL requires (in the best case) only to adapt the kernel-launch configuration to the different platforms.

The OpenCL device fission extension allows to investigate the scaling of the QMaster code with the number of CPU cores. We observe a linear scaling up to 48 cores.
The OpenCL device fission extension allows us to investigate the scaling of the QMaster code with the number of CPU cores. We observe a linear scaling up to 48 cores.

QMaster implements an extension of the hierarchical equation of motion (HEOM) method originally proposed by Tanimura and Kubo, which involves many (small) matrix-matrix multiplications. For GPU applications, the usage of local memory and the optimal thread-grids for fast matrix-matrix multiplications have been described before and are used in QMaster (and the publicly available GPU-HEOM tool on While for GPUs the best performance is achieved using shared/local memory and assign one thread to each matrix element, the multi-core CPU OpenCL variant performs better with fewer threads, but getting more work per thread done. Therefore we use for the CPU machines a thread-grid which computes one complete matrix product per thread (this is somewhat similar to following the “naive” approach given in NVIDIA’s OpenCL programming guide, chapter 2.5). This strategy did not work very well for the Xeon PHI/MIC OpenCL case, which requires additional data structure changes, as we learnt from discussions with the distributed algorithms and hpc experts in the group of Prof. Reinefeld at the Zuse-Institute in Berlin.
The good performance and scaling across the 64 CPU AMD opteron workstation positively surprised us and lays the groundwork to investigate the validity of approximations to the energy-transfer equations in the spinach light-harvesting system, the topic for the next post.

Tutorial #1: simulate 2d spectra of light-harvesting complexes with GPU-HEOM @ nanoHub

The computation and prediction of two-dimensional (2d) echo spectra of photosynthetic complexes is a daunting task and requires enormous computational resources – if done without drastic simplifications. However, such computations are absolutely required to test and validate our understanding of energy transfer in photosyntheses. You can find some background material in the recently published lecture notes on Modelling excitonic-energy transfer in light-harvesting complexes (arxiv version) of the Latin American School of Physics Marcos Moshinsky.
The ability to compute 2d spectra of photosynthetic complexes without resorting to strong approximations is to my knowledge an exclusive privilege of the Hierarchical Equations of Motion (HEOM) method due to its superior performance on massively-parallel graphics processing units (GPUs). You can find some background material on the GPU performance in the two conference talks Christoph Kreisbeck and I presented at the GTC 2014 conference (recored talk, slides) and the first nanoHub users meeting.

GPU-HEOM 2d spectra computed at nanohub

GPU-HEOM 2d spectra computed at nanohubComputed 2d spectra for the FMO complex for 0 picosecond delay time (upper panel) and 1 ps (lower panel). The GPU-HEOM computation takes about 40 min on the platform and includes all six Liouville pathways and averages over 4 spatial orientations.
  1. login on (it’s free!)
  2. switch to the gpuheompop tool
  3. click the Launch Tool button (java required)
  4. for this tutorial we use the example input for “FMO coherence, 1 peak spectral density“.
    You can select this preset from the Example selector.
  5. we stick with the provided Exciton System parameters and only change the temperature to 77 K to compare the results with our published data.
  6. in the Spectral Density tab, leave all parameters at the the suggested values
  7. to compute 2d spectra, switch to the Calculation mode tab
  8. for compute: choose “two-dimensional spectra“. This brings up input-masks for setting the directions of all dipole vectors, we stick with the provided values. However, we select Rotational averaging: “four shot rotational average” and activate all six Liouville pathways by setting ground st[ate] bleach reph[asing , stim[ulated] emission reph[asing], and excited st[ate] abs[orption] to yes, as well as their non-rephasing counterparts (attention! this might require to resize the input-mask by pulling at the lower right corner)
  9. That’s all! Hit the Simulate button and your job will be executed on the carter GPU cluster at Purdue university. The simulation takes about 40 minutes of GPU time, which is orders of magnitude faster than any other published method with the same accuracy. You can close and reopen your session in between.
  10. Voila: your first FMO spectra appears.
  11. Now its time to change parameters. What happens at higher temperature?
  12. If you like the results or use them in your work for comparison, we (and the folks at nanoHub who generously develop and provide the nanoHub platform and GPU computation time) appreciate a citation. To make this step easy, a DOI number and reference information is listed at the bottom of the About tab of the tool-page.

With GPU-HEOM we and now you (!) can not only calculate the 2d echo spectra of the Fenna-Matthews-Olson (FMO) complex, but also reveal the strong link between the continuum part of the vibrational spectral density and the prevalence of long-lasting electronic coherences as written in my previous posts

GPU and cloud computing conferences in 2014

Two conferences are currently open for registration related to GPU and cloud computing. I will be attending and presenting at both, please email me if you want to get in touch at the meetings.

Oscillations in two-dimensional spectroscopy

Transition from electronic coherence to a vibrational mode.
Transition from electronic coherence to a vibrational mode made visible by Short Time Fourier Transform (see text).

Over the last years, a debate is going on whether the observation of long lasting oscillatory signals in two-dimensional spectra are reflecting vibrational of electronic coherences and how the functioning of the molecule is affected. Christoph Kreisbeck and I have performed a detailed theoretical analysis of oscillations in the Fenna-Matthews-Olson (FMO) complex and in a model three-site system. As explained in a previous post, the prerequisites for long-lasting electronic coherences are two features of the continuous part of the vibronic mode density are: (i) a small slope towards zero frequency, and (ii) a coupling to the excitonic eigenenergy (ΔE) differences for relaxation. Both requirements are met by the mode density of the FMO complex and the computationally demanding calculation of two-dimensional spectra of the FMO complex indeed predicts long-lasting cross-peak oscillations with a period matching h/ΔE at room temperature (see our article Long-Lived Electronic Coherence in Dissipative Exciton-Dynamics of Light-Harvesting Complexes or arXiv version). The persistence of oscillations is stemming from a robust mechanism and does not require adding any additional vibrational modes at energies ΔE (the general background mode density is enough to support the relaxation toward a thermal state). But what happens if in addition to the background vibronic mode density additional vibronic modes are placed within the vicinity of the frequencies related electronic coherences? This fine-tuning model is sometimes discussed in the literature as an alternative mechanism for long-lasting oscillations of vibronic nature. Again, the answer requires to actually compute two-dimensional spectra and to carefully analyze the possible chain of laser-molecule interactions. Due to the special way two-dimensional spectra are measured, the observed signal is a superposition of at least three pathways, which have different sensitivity for distinguishing electronic and vibronic coherences. Being a theoretical physicists now pays off since we have calculated and analyzed the three pathways separately (see our recent publication Disentangling Electronic and Vibronic Coherences in Two-Dimensional Echo Spectra or arXiv version). One of the pathways leads to an enhancement of vibronic signals, while the combination of the remaining two diminishes electronic coherences otherwise clearly visible within each of them. Our conclusion is that estimates of decoherence times from two-dimensional spectroscopy might actually underestimate the persistence of electronic coherences, which are helping the transport through the FMO network. The fine tuning and addition of specific vibrational modes leaves it marks at certain spots of the two-dimensional spectra, but does not destroy the electronic coherence, which is still there as a Short Time Fourier Transform of the signal reveals.

Computational physics on GPUs: writing portable code

GPU-HEOM code comparison for various hardware.
Runtime in seconds for our GPU-HEOM code on various hardware and software platforms.

I am preparing my presentation for the simGPU meeting next week in Freudenstadt, Germany, and performed some benchmarks.
In the previous post I described how to get an OpenCL program running on a smartphone with GPU. By now Christoph Kreisbeck and I are getting ready to release our first smartphone GPU app for exciton dynamics in photosynthetic complexes, more about that in a future entry.
Getting the same OpenCL kernel running on laptop GPUs, workstation GPUs and CPUs, and smartphones/tablets is a bit tricky, due to different initialisation procedures and the differences in the optimal block sizes for the thread grid. In addition on a smartphone the local memory is even smaller than on a desktop GPU and double-precision floating point support is missing. The situation reminds me a bit of the “earlier days” of GPU programming in 2008.
Besides being a proof of concept, I see writing portable code as a sort of insurance with respect to further changes of hardware (however always with the goal to stick with the massively parallel programming paradigm). I am also amazed how fast smartphones are gaining computational power through GPUs!
Same comparison for smaller memory consumption. Note the drop in OpenCL performance for the NVIDIA K20c GPU.
Same comparison for smaller memory consumption. Note the drop in OpenCL performance for the NVIDIA K20c GPU.

Here some considerations and observations:

  1. Standard CUDA code can be ported to OpenCL within a reasonable time-frame. I found the following resources helpful:
    AMDs porting remarks
    Matt Scarpinos OpenCL blog
  2. The comparison of OpenCL vs CUDA performance for the same algorithm can reveal some surprises on NVIDIA GPUs. While on our C2050 GPU OpenCL works a bit faster for the same problem compared to the CUDA version, on a K20c system for certain problem sizes the OpenCL program can take several times longer than the CUDA code (no changes in the basic algorithm or workgroup sizes).
  3. The comparison with a CPU version running on 8 cores of the Intel Xeon machine is possible and shows clearly that the GPU code is always faster, but requires a certain minimal systems size to show its full performance.
  4. I am looking forward to running the same code on the Intel Xeon Phi systems now available with OpenCL drivers, see also this blog.

[Update June 22, 2013: I updated the graphs to show the 8-core results using Intels latest OpenCL SDK. This brings the CPU runtimes down by a factor of 2! Meanwhile I am eagerly awaiting the possibility to run the same code on the Xeon Phis…]

Computational physics on the smartphone GPU

Screenshot of the interacting many-body simulation on the Nexus 4 GPU.
Screenshot of the interacting many-body simulation on the Nexus 4 GPU.

[Update August 2013: Google has removed the OpenCL library with Android 4.3. You can find an interesting discussion here. Google seems to push for its own renderscript protocol. I will not work with renderscript since my priorities are platform independency and sticking with widely adopted  standards to avoid fragmentation of my code basis.]
I recently got hold of a Nexus 4 smartphone, which features a GPU (Qualcomm Adreno 320) and conveniently ships with already installed OpenCL library. With minimal changes I got the previously discussed many-body program code related to the fractional quantum Hall effect up and running. No unrooting of the phone is required to run the code example. Please use the following recipe at your own risk, I don’t accept any liabilities. Here is what I did:

  1. Download and unpack the Android SDK from google for cross-compilation (my host computer runs Mac OS X).
  2. Download and unpack the Android NDK from google to build minimal C/C++ programs without Java (no real app).
  3. Install the standalone toolchain from the Android NDK. I used the following command for my installation:

    /home/tkramer/android-ndk-r8d/build/tools/ \
  4. Put the OpenCL programs and source code in an extra directory, as described in my previous post
  5. Change one line in the cl.hpp header: instead of including <GL/gl.h> change to <GLES/gl.h>. Note: I am using the “old” cl.hpp bindings 1.1, further changes might be required for the newer bindings, see for instance this helpful blog
  6. Transfer the OpenCL library from the phone to a subdirectory lib/ inside your source code. To do so append the path to your SDK tools and use the adb command:

    export PATH=/home/tkramer/adt-bundle-mac-x86_64-20130219/sdk/platform-tools:$PATH
    adb pull /system/lib/
  7. Cross compile your program. I used the following script, please feel free to provide shorter versions. Adjust the include directories and library directories for your installation.

    rm plasma_disk_gpu
    /home/tkramer/android-ndk-standalone/bin/arm-linux-androideabi-g++ -v -g \
    -I. \
    -I/home/tkramer/android-ndk-standalone/include/c++/4.6 \
    -I/home/tkramer/android-ndk-r8d/platforms/android-5/arch-arm/usr/include \
    -Llib \
    -march=armv7-a -mfloat-abi=softfp -mfpu=neon \
    -fpic -fsigned-char -fdata-sections -funwind-tables -fstack-protector \
    -ffunction-sections -fdiagnostics-show-option -fPIC \
    -fno-strict-aliasing -fno-omit-frame-pointer -fno-rtti \
    -lOpenCL \
    -o plasma_disk_gpu plasma_disk.cpp
  8. Copy the executable to the data dir of your phone to be able to run it. This can be done without rooting the phone with the nice SSHDroid App, which by defaults transfers to /data . Don’t forget to copy the kernel .cl files:

    scp -P 2222 root@192.168.0.NNN:
    scp -P 2222 plasma_disk_gpu root@192.168.0.NNN:
  9. ssh into your phone and run the GPU program:
    ssh -p 2222 root@192.168.0.NNN
    ./plasma_disk_gpu 64 16
  10. Check the resulting data files. You can copy them for example to the Download path of the storage and use the gnuplot (droidplot App) to plot them.

A short note about runtimes. On the Nexus 4 device the program runs for about 12 seconds, on a MacBook Pro with NVIDIA GT650M it completes in 2 seconds (in the example above the equations of motion for 16*64=1024 interacting particles are integrated). For larger particle numbers the phone often locks up.

An alternative way to transfer files to the device is to connect via USB cable and to install the Android Terminal Emulator app. Next

cd /data/data/jackpal.androidterm
mkdir gpu
chmod 777 gpu

On the host computer use adb to transfer the compiled program and the .cl kernel and start a shell to run the kernel

adb push /data/data/jackpal.androidterm/gpu/
adb push plasma_disk_gpu /data/data/jackpal.androidterm/gpu/

You can either run the program within the terminal emulator or use the adb shell

adb shell
cd /data/data/jackpal.androidterm/gpu/
./plasma_disk_gpu 64 16

Let’s see in how many years todays desktop GPUs can be found in smartphones and which computational physics codes can be run!

Computational physics & GPU programming: exciton lab for light-harvesting complexes (GPU-HEOM) goes live on

User interface of the GPU-HEOM tool for light-harvesting complexes at
User interface of the GPU-HEOM tool for light-harvesting complexes at

Christoph Kreisbeck and I are happy to announce the public availability of the Exciton Dynamics Lab for Light-
Harvesting Complexes (GPU-HEOM) hosted on You need to register a user account (its free), and then you are ready to use GPU-HEOM for the Frenkel exciton model of light harvesting complexes. In release 1.0 we support

  • calculating population dynamics 
  • tracking coherences between two eigenstates
  • obtaining absorption spectra
  • two-dimensional echo spectra (including excited state absorption)
  • … and all this for general vibronic spectral densities parametrized by shifted Lorentzians.

I will post some more entries here describing how to use the tool for understanding how the spectral density affects the lifetime of electronic coherences (see also this blog entry).
In the supporting document section you find details of the implemented method and the assumptions underlying the tool. We are appreciating your feedback for further improving the tool.
We are grateful for the support of Prof. Gerhard Klimeck, Purdue University, director of the Network for Computational Nanotechnology to bring GPU computing to nanohub (I believe our tool is the first GPU enabled one at nanohub).

If you want to refer to the tool you can cite it as:

Christoph Kreisbeck; Tobias Kramer (2013), “Exciton Dynamics Lab for Light-Harvesting Complexes (GPU-HEOM),” (DOI:10.4231/D3RB6W248).

and you find further references in the supporting documentation.

I very much encourage my colleagues developing computer programs for theoretical physics and chemistry to make them available on platforms such as In my view, it greatly facilitates the comparison of different approaches and is the spirit of advancing science by sharing knowledge and providing reproducible data sets.

Computational physics & GPU programming: interacting many-body simulation with OpenCL

Trajectories in a two-dimensional interacting plasma simulation, reproducing the density and pair-distribution function of a Laughlin state relevant for the quantum Hall effect. Figure taken from Interacting electrons in a magnetic field: mapping quantum mechanics to a classical ersatz-system.

In the second example of my series on GPU programming for scientists, I discuss a short OpenCL program, which you can compile and run on the CPU and the GPUs of various vendors. This gives me the opportunity to perform some cross-platform benchmarks for a classical plasma simulation. You can expect dramatic (several 100 fold) speed-ups on GPUs for this type of system. This is one of the reasons why molecular dynamics code can gain quite a lot by incorporating the massively parallel-programming paradigm in the algorithmic foundations.

The Open Computing Language (OpenCL) is relatively similar to its CUDA pendant, in practice the setup of an OpenCL kernel requires some housekeeping work, which might make the code look a bit more involved. I have based my interacting electrons calculation of transport in the Hall effect on an OpenCL code. Another examples is An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs (arxiv version) by C. Ó Broin and L.A.A. Nikolopoulos.

Now to the coding of a two-dimensional plasma simulation, which is inspired by Laughlin’s mapping of a many-body wave function to an interacting classical ersatz dynamics (for some context see my short review Interacting electrons in a magnetic field: mapping quantum mechanics to a classical ersatz-system on the arxiv).

Continue reading “Computational physics & GPU programming: interacting many-body simulation with OpenCL”

Computational physics & GPU programming: Solving the time-dependent Schrödinger equation

I start my series on the physics of GPU programming by a relatively simple example, which makes use of a mix of library calls and well-documented GPU kernels. The run-time of the split-step algorithm described here is about 280 seconds for the CPU version (Intel(R) Xeon(R) CPU E5420 @ 2.50GHz), vs. 10 seconds for the GPU version (NVIDIA(R) Tesla C1060 GPU), resulting in 28 fold speed-up! On a C2070 the run time is less than 5 seconds, yielding an 80 fold speedup.

autocorrelation function in a uniform force field
Autocorrelation function C(t) of a Gaussian wavepacket in a uniform force field. I compare the GPU and CPU results using the wavepacket code.

The description of coherent electron transport in quasi two-dimensional electron gases requires to solve the Schrödinger equation in the presence of a potential landscape. As discussed in my post Time to find eigenvalues without diagonalization, our approach using wavepackets allows one to obtain the scattering matrix over a wide range of energies from a single wavepacket run without the need to diagonalize a matrix. In the following I discuss the basic example of propagating a wavepacket and obtaining the autocorrelation function, which in turn determines the spectrum. I programmed the GPU code in 2008 as a first test to evaluate the potential of GPGPU programming for my research. At that time double-precision floating support was lacking and the fast Fourier transform (FFT) implementations were little developed. Starting with CUDA 3.0, the program runs fine in double precision and my group used the algorithm for calculating electron flow through nanodevices. The CPU version was used for our articles in Physica Scripta Wave packet approach to transport in mesoscopic systems and the Physical Review B Phase shifts and phase π-jumps in four-terminal waveguide Aharonov-Bohm interferometers among others.
Here, I consider a very simple example, the propagation of a Gaussian wavepacket in a uniform potential V(x,y)=-Fx, for which the autocorrelation function of the initial state
⟨x,y|ψ(t=0)⟩=1/(a√π)exp(-(x2+y2)/(2 a2))
is known in analytic form:
⟨ψ(t=0)|ψ(t)⟩=2a2m/(2a2m+iℏt)exp(-a2F2t2/(4ℏ2)-iF2t3/(24ℏ m)).
Continue reading “Computational physics & GPU programming: Solving the time-dependent Schrödinger equation”

The physics of GPU programming

GPU cluster
Me pointing at the GPU Resonance cluster at SEAS Harvard with 32x448=14336 processing cores. Just imagine how tightly integrated this setup is compared to 3584 quad-core computers. Picture courtesy of Academic Computing, SEAS Harvard.

From discussions I learn that while many physicists have heard of Graphics Processing Units as fast computers, resistance to use them is widespread. One of the reasons is that physics has been relying on computers for a long time and tons of old, well trusted codes are lying around which are not easily ported to the GPU. Interestingly, the adoption of GPUs happens much faster in biology, medical imaging, and engineering.
I view GPU computing as a great opportunity to investigate new physics and my feeling is that todays methods optimized for serial processors may need to be replaced by a different set of standard methods which scale better with massively parallel processors. In 2008 I dived into GPU programming for a couple of reasons:

  1. As a “model-builder” the GPU allows me to reconsider previous limitations and simplifications of models and use the GPU power to solve the extended models.
  2. The turn-around time is incredibly fast. Compared to queues in conventional clusters where I wait for days or weeks, I get back results with 10000 CPU hours compute time the very same day. This in turn further facilitates the model-building process.
  3. Some people complain about the strict synchronization requirements when running GPU codes. In my view this is an advantage, since essentially no messaging overhead exists.
  4. If you want to develop high-performance algorithm, it is not good enough to convert library calls to GPU library calls. You might get speed-ups of about 2-4. However, if you invest the time and develop your own know-how you can expect much higher speed-ups of around 100 times or more, as seen in the applications I discussed in this blog before.

This summer I will lecture about GPU programming at several places and thus I plan to write a series of GPU related posts. I do have a complementary background in mathematical physics and special functions, which I find very useful in relation with GPU programming since new physical models require a stringent mathematical foundation and numerical studies.