Parallel Simulation of Quantum Coherent Evolution by Parameterized Arbitrary Time-dependent Hamiltonians (PSiQCoPATH)

Eric Fellheimer and Mark Rudner

Implementation

PSiQCoPATH is written in C++ with the MPI library of communication directives to control inter-processor communication. Throughout the design process, we tried to maintain the robustness and adaptability of our code through the techniques of object oriented programming (OOP). To this end, we invested a significant amount of time to the design of proper abstractions in our system. The subject of object oriented programming in the development of PSiQCoPATH will be discussed further in section [*].

Input Files

The input file format for PSiQCoPATH is simple and straightforward. Whitespace is used as the delimiter between entries. Complex numbers such as $4 + 5i$ are supplied as ordered pairs of their real and imaginary parts in parentheses, i.e. $(4,5)$.

Although there may be more efficient ways to store the input data, taking the most straightforward approach left us more time to concentrate on the algorithms and parallelization of our code. Furthermore, the simple whitespace delimited text file approach should not lead to any problems if PSiQCoPATH is ported across platforms. Figure [*] is an example of an input file for a very simple test run.

The entry of the first line is used to set the value of the isAlpha flag in the code. The value of this flag determines whether the program will run in Hamiltonian evolution mode or quantum circuit mode. In this example, the input ``alpha'' is used to set isAlpha to 1 and hence to instruct PSiQCoPATH to run in Hamiltonian evolution mode. In this case, the rest of the input is a used to set up the parameters of the desired time-dependent Hamiltonian.

If ``qga'' is instead specified, the program will run in quantum circuit mode. In this case, the remainder of the input file contains a series of unitary quantum gates to be applied to the system.

The next three lines specify the size of the Hilbert space, the number of terms in the parametric expansion of the Hamiltonian, and $T$, the number of time steps to be calculated. In the example, the Hilbert space has $2$ dimensions, the problem is parametrized by $1$ hamiltonian, and $5$ time steps are to be run.

In the line that follows, the user specifies the values of $t$ at each of the $T$ time steps. Note that this allows the possibility of non-uniform time steps. In the example, however, the time steps are equally sized.

The next three lines are used to specify the values of $\alpha(t)$, $\dot{\alpha}(t)$, and $\ddot{\alpha}(t)$ for the first basis Hamiltonian at each time step. In the example $\alpha(t) = 1$, and $\dot{\alpha} = \ddot{\alpha} = 0$.

Next, the matrix representation of the first ``basis'' Hamiltonian $\hat{H}_1$ is supplied. In the example, the two-dimensional identity matrix is used. In the general case where more than one basis Hamiltonian is required, the coefficients and basis Hamiltonians of the remaining pieces of the total Hamiltonian supplied next. The format for each piece is the same - a list of values for the coefficients and their derivatives followed by the matrix representation of the basis Hamiltonian.

The final few lines determine the parameters of the output. First, an output filename is specified. Next, the starting point of the output and frequency at which to store the state of the system are given. In the example, the code is instructed to start the output at the first time step, and only save one state vector.

Finally, the value of the outputType flag is set. The value ``evolution'' is used to instruct the code to perform a full time evolution matrix calculation. Alternatively, the value ``state'' is used to instruct the code to perform a single state evolution. In this case, the code will then read in the column vector corresponding to the desired initial state, the vector $\left(\begin{array}{cc} 1 \ 0\end{array}\right)$ in this example.

Figure: Example of a simple input file. The Hamiltonian is the identity in this case.
\begin{figure}\begin{center}
\begin{verbatim}alpha2
1
51 2 3 4 51 1 1 1 ...
...0 01 0
0 1alphaOut.out
1
1
1
state
1
0\end{verbatim}\end{center}\end{figure}

Figure [*] is a very simple example. For more realistic simulations of larger systems over longer time intervals, it was necessary to write additional scripts to generate the input files automatically. Examples of such scripts are grover.m (written in Matlab) and genAdQCInFile.cpp (written in C++). These scripts are included in the source code archive available on the PSiQCoPATH website.

Output

We also tried to maintain simplicity when creating output files. To aid in data analysis, the output files are created in a format easily read in by Matlab.

In Hamiltonian evolution mode, three output files are created. In the first file, the calculated states or matrices at the desired number of output checkpoints are written. A coefficient file is also created, containing the values of $t$, $\alpha(t)$, $\dot{\alpha}(t)$, and $\ddot{\alpha}(t)$ for each time step. The last file contains the relevant parameters of the calculation and total running time.

Parallel Prefix

Using decomposition ([*]) of the introduction, a system's time evolution can be represented by the sequence of ``incremental'' time evolution operators
$\displaystyle \left\{\hbox{\usefont{U}{bbm}{m}{n} \selectfont 1}, \hat{U}(t_1,...
...
\hat{U}(t_3,   t_2)\hat{U}(t_2,   t_1)\hat{U}(t_1,  t_0) , \ldots\right\}$     (1)

The incremental time evolution operator of time step $k$, $\hat{U}(t_{k + 1},  t_k)$, depends only on the local values of $\hat{H}(t_k)$, $\dot{\hat{H}}(t_k)$, etc. As a result, the calculation of each time step's incremental evolution operator can be performed independently of all others. This fact makes the task of generating the set of operators

\begin{displaymath}\left\{\hat{U}(t_1, t_0), \hat{U}(t_2, t_1), \hat{U}(t_3, t_2), \ldots\right\}\end{displaymath}

``embarassingly parallel.''

Expression ([*]) shows the explicit form of the incremental evolution operators in terms of the Hamiltonian and its derivatives. In general to any order, this expression involving products of $\hat{H}$ and itself/its derivatives. On the surface, this seems to indicate that calculating the incremental evolution operators is an expensive operation requiring many matrix multiplication operations, each scaling as $\mathcal{O}(N^3)$. However, expansion ([*]) makes it possible to push the $\mathcal{O}(T \cdot N^3)$ cost of the matrix multiplication operations needed to calculate the incremental evolution operators for the entire run into a single group of matrix multiplications at the beginning of the computation. The number of matrix multiplications at this step depends on the order of the calculation and the number of basis Hamiltonians employed. Under most circumstances, only a few basis Hamiltonians are required and this step comes with minimal computational cost compared to the rest of the calculation.

The advantage comes from the observation that, up to second order for example, $\hat{U}(t_k + \Delta t,  t_k)$ is just a linear combination of the static basis Hamiltonians $\{\hat{H}_j\}$ and their binary products $\{\hat{H}_i\hat{H}_{j}\}$. The coefficients of this linear combination are just combinations of the (real) scalars $\alpha_j(t_k), \dot{\alpha}_j(t_k), \ldots$. Thus as the first ``pre-computation'' step of the calculation, PSiQCoPATH calculates the set of matrices

$\displaystyle \hat{B}_{ij} = \hat{H}_i\hat{H}_{j}.$     (2)

For a third order calculation, we would additionally need to calculate an analogous quantities $\hat{T}_{ijk} = \hat{H}_i\hat{H}_{j}\hat{H}_k$ and so forth. Once these multiplications have been performed, the calculation of each incremental evolution operator involves only matrix addition operations.

In our code, the simulation is divided into $p$ contiguous blocks of time of roughly equal size, where $p$ is the number of processors in use. Each processor generates and stores the incremental time evolution operators for all time steps within its allotted block of time.

Once the incremental time step evolution operators have been calculated, the final step is to combine them through matrix multiplication into the sequence ([*]). This is a straightforward application of the parallel prefix algorithm with matrix multiplication playing the role of the associative binary operator. Because matrix multiplication is non-commutative, it is critical to maintain proper operator ordering throughout the calculation.

In general, the number of time steps is much larger than the number of processors available. In this case, the parallel prefix algorithm begins with each processor performing a serial scan operation on its own block of data. Once these local scans are complete, the standard parallel prefix binary tree ascent/descent steps are performed on the top-most (latest time) elements of each processor's data. Finally, each processor other than the root performs a second serial update of its data by right-multiplying each of its own matrices by the top-most element of its earlier time neighbor.

When the number of time steps is much larger than the number of processors, the local serial scan steps dominate the running time to give $\mathcal{O}(2T\cdot N^3/p)$ scaling, where $T$ is the number of time steps, $N$ is the dimension of the system, and $p$ is the number of processors in use. Using an improvement discussed in the section on future work, the prefactor $2$ can be reduced to $1 + f$, where $f$ is the ratio of output steps requested by the user divided by the total number of time steps. Typically, this number is much less than $1$, leading to a nearly 2-fold speedup.

Row Distributed Multiplication

If the user is interested only in the evolution of a particular initial state, calculation of the full time evolution operator calculation is unnecessarily computationally expensive. That is, a particular initial state $\vert   \psi(t_0)   \rangle $ can be evolved to the state $\vert   \psi(t)   \rangle $ at a time $t > t_0$ at significantly lower computational cost than that of a full time evolution matrix calculation.

Using the method described above, we can calculate the incremental time evolution operators $\hat U(t_k +\Delta t,  t_k)$ to any desired order. The initial state can then be evolved by successively applying the incremental evolution operators to it. In terms of linear algebra, this is simply the problem of calculating: $U_1 x$, $U_2 U_1 x$, $U_3 U_2 U_1 x$, ..., $U_T U_{T - 1} \cdots U_1 x$ where $x$ is an $N\times 1$ column vector and each $U_i$ is an $N \times N$ unitary matrix.

Rather than using serial/parallel scan operations to first compute all of the matrix products and then multiply by the initial state vector to get the final result, a method that scales as $\mathcal{O}(N^3)$, the calculation can be performed using exclusively matrix-vector multiplication operations that scale as $\mathcal{O}(N^2)$. That is, the calculation starts by computing $U_1 x$. This result can the be used to calculate $U_2 U_1 x$ = $U_2 (U_1 x)$, and so on.

Because of its $N$-fold better scaling properies, we used a data-distributed version of this latter technique of successive matrix vector multiplication to parallelize the calculation of single state evolution. Let $p$ be the the number of processing units and $N$ be the dimension of the Hilbert Space. Let $T$ be the number of matrices in the operation, i.e. the length of the simulation. The algorithm is quite simple:

  1. Each processor stores approximately $N/p$ rows of each of the $T$ matrices $U_i$. Let $U_i^k$ be the ``local'' matrix stored by processor $k$. That is, $U_i^k$ contains $N/p$ rows of $U_i$.
  2. Let the vector $curRes = x$, where $x$ is the initial state of the system.
  3. for $i = 1$ to $T$:
    1. Each processor $k$ calculates $localRes = U_i^k \times curRes$
    2. Each processor broadcasts $localRes$ to all others. The next state vector is reconstructed from each processor's $localRes$ vector and stored in $curRes$

Runtime analysis of Row Distributed Matrix Vector Multiplications

The running time associated with this algorithm will be:

\begin{displaymath}
\mathcal{O} \left( \frac{N^2}{p} + C(N,p) \right) T
\end{displaymath} (3)

where $C(N,p)$ is the time associated with sharing the local result $localRes$ with all other processors. No MPI documentation we found could give a good estimate on running time of MPI_Allgather. Nonetheless, a safe upper bound is that $C(N,p) = \mathcal{O}(p (p-1) N/p) = \mathcal{O}(Np)$. That is, in the worst case, each processor sends $\mathcal{O}(N/p)$ data elements to the $(p - 1)$ other processes.

Thus, if $N$ is large, then $C(N,p)$ is dominated by the matrix vector multiplications. In this case, the running time is close to:

\begin{displaymath}
\mathcal{O} \left ( \frac{N^2T}{p} \right)
\end{displaymath} (4)

This result is highly desirable because it as a $p$-fold speedup over the ``ideal'' serial algorithm based on matrix-vector multiplication.

Results

PSiQCoPATH's first sanity check came in the form of the simplest possible quantum system: a spin-1/2 magnetic moment in a static applied magnetic field. In this situation, the well known exact solution is that the spin precesses about the field with angular frequency $\omega_L$, the Larmoor frequency. Figure [*] shows the Bloch sphere representation of the trajectory of a spin initially aligned in the $z$-direction in the presence of a magnetic field parallel to the $y$-direction. From this plot it is clear that the spin exhibits precession as we know it should.

Figure: Bloch sphere representation of spin trajectory with $\hat{H} =\hat{\sigma}^y$
[height=3in]sigmaYSpin.jpg

This case is perhaps too simple to use as a test case, though, as the Hamiltonian is independent of time. The second test was to have PSiQCoPATH simulate the behavior of a spin-1/2 magnetic moment in a time-varying magnetic field. The magnetic field varied in time according to the equation $\vec{B}(t) = \sin(\Omega_B t) \vec{j} + \cos(\Omega_B t) \vec{k}$, where $\vec{j}$ and $\vec{k}$ are unit vectors in the $y$ and $z$ directions, respectively. The Hamiltonian for this system is

$\displaystyle \hat{H}(t) = \sin(\Omega_B t)  \hat{\sigma}^y + \cos(\Omega_B t) \hat{\sigma}^z$     (5)

where $\hat{\sigma}^y$ and $\hat{\sigma}^z$ are the Pauli spin operators. In the standard basis, the Pauli operators are represented by the matrices
$\displaystyle \sigma^x = \left(\begin{array}{cc} 0 & 1 \  1 & 0\end{array}\rig...
...ght),   \sigma^z = \left(\begin{array}{cc} 1 & 0 \  0 & -1\end{array}\right)$     (6)

As the initial condition, the spin was aligned with the field along the z-axis. Some of the results of these calculations are shown in figure [*].

Figure: Bloch sphere representation of spin trajectories with $\hat{H} = \sin(\Omega_B t)\hat{\sigma}^y + \cos(\Omega_B t)\hat{\sigma}^z$
[height=3in]AdiabaticSpins.jpg

In the case where $\Omega_B/\omega_L = 0.01$, the motion is very nearly adiabatic. That is, the spin direction very closely follows the field direction. As $\Omega_B$ increases, the trajectory acquires increasingly large cycloid-like wiggles. Although this behavior was not expected beforehand, in retrospect it is easy to understand.

Figure: Equivalent problem - the trajectory of a point on the rim of a cone rolling on a flat surface
[height=2in]cone.jpg

The key to this understanding is a mapping that we discovered between this problem and the trajectory of a point on the rim of a cone rolling on a flat surface (see figure [*]). This mapping comes from the fact that a magnetic field generates rotations about its direction with frequency $\omega_L$. In our case, this rotation axis is itself rotating with frequency $\Omega_B$ in the $yz$-plane.

The rolling motion of a cone on flat surface consists of two combined rotations - the cone rotates about its own symmetry axis with frequency $\omega_a$ and about the vertical axis through its tip with frequency $\Omega$. Under the condition of rolling without slippage, the combined effect of these two angular velocities is a net angular velocity along the line of contact between the cone and the surface. As the cone rolls, the direction of this instantaneous axis of rotation rotates about the vertical direction with frequency $\Omega$.

Thus the instantaneous axis of rotation in the cone problem has exactly the same behavior as the instantaneous axis of rotation (the magnetic field) in our spin problem. At a given instant, all points on the cone are rotating about the instantaneous axis of rotation, just as at any given instant the spin is precessing about the instantaneous direction of the magnetic field.

The half-angle of the cone corresponding to a particular choice of $\omega_L$ and $\Omega_B$ in the quantum spin problem can be found by simple trigonometry, and is given by the relation

$\displaystyle \tan\alpha = \frac{\Omega_B}{\omega_L}$     (7)

Note that $\alpha \rightarrow 0$ as $\Omega_B/\omega_L \rightarrow 0$. This means that for very slowly changing fields, the maximum amplitude of the spin's deviations from the field direction goes to zero as expected in the adiabatic limit. From this analysis, we see that any initial condition that starts the quantum spin at a point on the rim of the cone associated with the particular values of $\Omega_B$ and $\omega_L$ of that system will lead to a spin trajectory equivalent to that of the corresponding point on the rim of the associated rolling cone.

Aside from the intellectual interest of this result, it also turns out to be one of the rare cases of a Hamiltonian with non-trivial time dependence for which we have an exact answer to compare with the simulation. The agreement with our results appears to be quite good, though we have not explored it in quantitative detail.

Once this method validation was complete, we used PSiQCoPATH to simulate the solution of a few instances of NP-Complete problems using the method of quantum computation by adiabatic evolution described in the introduction and the reference given there. The particular problem for which we had easy access the proper Hamiltonians was the so-called exact cover problem. Exact cover is a version of satisfiability involving 3 bit clauses of the form

$\displaystyle z_i + z_j + z_k = 1$     (8)

where $z_i$, $z_j$, and $z_k$ are bits that take on the values 0 or 1.

This problem is described in detail in the paper by Farhi et al. As in that paper, we use a linear interpolation between the initial and final Hamiltonians of the form

$\displaystyle \hat{H}(t) = \left(\frac{t}{T}\right) \hat{H}_p + \left(1 - \frac{t}{T}\right)\hat{H}_0$     (9)

where $\hat{H}_p$ is the ``problem'' Hamiltonian and $\hat{H}_0$ is an initial Hamiltonian with ground state
$\displaystyle \vert   \psi_0   \rangle = \frac{1}{\sqrt{2^N}}\left(\vert   0   \rangle + \vert   1   \rangle \right)^{\otimes N},$     (10)

and $T$ is the length of the run. Larger values of $T$ correspond to longer runs, and hence slower evolution. Thus the evolution should become increasingly adiabatic as $T$ becomes large.

We would like to thank Daniel Nagaj for supplying us with these Hamiltonians. An example of our results for a 6 qubit instance of Exact Cover are shown in figure [*].

Figure: Eigenstate populations vs time for 6 qubit Exact Cover for sweep rates $T = 50$ and $T = 500$
[height=2.5in]EC6.jpg

These plots were generated by a Matlab script written by us to parse the PSiQCoPATH output files and perform the desired analysis. The script diagonalizes the system's Hamiltonian at each output time step and transforms the evolved state at the corresponding time step into this eigenbasis. The $n^{th}$ eigenstate population is equal to the square magnitude of the $n^{th}$ component of the evolved state in the instantaneous eigenbasis. For $T = 50$, we see that the probability of finishing in the ground state, i.e. of obtaining the correct solution to the problem, is approximately 60%. When $T = 500$, this probability is very nearly 1.

In figure [*], the energy levels (eigenvalues) of the instantaneous Hamiltonian are plotted over the course of the evolution. Notice that the position of the minimum energy gap is precisely where the ground state population gets ``lost'' in the fast run. This is what is expected from the considerations of the adiabatic theorem, and makes for a nice confirmation of the theory.

Figure: Energy levels vs time for 6 qubit Exact Cover
[height=4in]EC6_50-Energies.jpg

In the end we were only able to test up to 8 qubits. This is really not enough to make progress over the current state of the art in research on this topic, but with the improvements described in the section on future work we should be able to scale up to much higher dimension. All results were obtained from full time evolution operator calculations. Although this calculation is in a sense overkill for what we have used it for in the analysis, the full time evolution matrix could be used to find the success probability in the case where the initial state is actually a ``mixed-state'' due to thermal noise and/or uncertain preparation. This is an interesting situation to look at from a practical point of in view as this is a more realistic picture of what the situation will be like in real physical implementations.

With the distributed data approach of the matrix-vector multiplication single state evolution algorithm, we should be able to reach even larger systems. Memory usage is significantly lower in that case, and the distribution across processors should allow us to handle much larger matrices without having to reach beyond the cache/fast memory. This code did not become operational until after the tests described, so we only have detailed results for the full time evolution operator calculations.

Throughout these runs we also kept track of PSiQCoPATH's performance in terms of running time. Running time as a function of simulation length and number of processors is plotted in figure [*]. The trends are very nearly linear in both $1/p$ and $T$, confirming our projected scaling rules.

Figure: Running time vs inverse number of processors
[height=4in]PSiQCoPATH-Timing.jpg

Future Work

One area where PSiQCoPATH could be improved siginificantly is in the area of memory management during parallel prefix runs. In its current form, the program first calculates and stores the incremental time evolution operators for all time steps. Each processor then performs a serial scan operation on its own block of data. The storage of all the incremental time evolution operators accounts for the vast majority of PSiQCoPATH's memory usage. This memory usage, in turn, is what limits the size of problem we are able to run.

We recently realized that it is not necessary to ever have all of the incremental time evolution operators stored at one time. In general, the number of output timesteps requested by the user is much less than the number of actual timesteps performed in the evolution (by a factor of perhaps 1000). Rather than storing all of the incremental operators, all we really need is that fraction of them corresponding to the much coarser output time step.

A much more efficent procedure would be to partially combine the first two steps in the following way. Let $\mathcal{N}$ be ratio of the total number of time steps to the number of output steps requested. Instead of storing every $\hat{U}(t_k + \Delta t,  t_k)$, we really only need to store each $\hat{U}_{\mathcal{N}}(t_k + \mathcal{N}\Delta t,  t_k)$ = $\hat{U}(t_k + \mathcal{N}\Delta t,  t_k + (\mathcal{N} - 1)\Delta t)\cdots\hat{U}(t_k + \Delta t,  t_k)$. We can build up $\hat{U}_{\mathcal{N}}$ by multiplying by each successive incremental time evolution operator as it is generated. Once $\mathcal{N}$ time steps have been calculated and combined, that $\hat{U}_{\mathcal{N}}$ can be stored in memory and the next one started. In this way, the memory requirements of the program will be cut roughly by a factor of $\mathcal{N}^{-1}$.

As an additional benefit, the final local update step will also be shortened by a factor of $\mathcal{N}^{-1}$. Overall this leads to a speedup of approximately $2/(1 + \mathcal{N}^{-1})$ in the projected running time.

Also, we did not fully explore alternative basic arithmetic algorithms that could speedup the system. For instance, Strassen's algorithm for matrix multiplication runs asymptotically faster than $n^3$. However, this algorithm has significantly different memory usage. Also, its overhead is much larger than standard matrix multiply. Thus, simply using Strassen's algorithm would not necessarily be an improvement. Possible changes of this sort are worth considering, and can easily be integrated and tested in our code due to its object-based construction.


Case Study on Object Oriented Parallel Programming

Throughout the course on parallel computing, most of the skeletal code we were given was not object-oriented. We made it a point to make our code as object-oriented as possible. Over the course of the project we found that many aspects of object oriented programming carry over directly to the parallel setting, but we also encountered some new challenges unique to the playing field of parallelism.

At a high level, the big advantage of object-oriented programming is the power of abstraction. We employed such abstraction with objects that we knew would be parallelized. For instance, in the ComplexMatrix class, we have methods such as send, receive, and rowDist to communicate matrix objects between processors.

The send and receive methods were relatively simple and worked well. These routines simply send entire matrices to/from other processors via the MPI. An alternative approach would have been to define a new MPI datatype for objects of the ComplexMatrix class, but we found it much simpler to simply add these communication methods in the class itself.

The rowDist method was somewhat awkward. Its goal was to distribute the data of a matrix stored completely on a single processor to all other processors in row-wise fashion. Similar to an MPI_Bcast command, every processor calls rowDist. Before calling this command, however, each processor had to determine how many rows it would store. This added an additional step to the method, but was not an insurmountable challenge.

Overall, we found using objected oriented techniques to be very useful in maintaining abstractions in an MPI-driven parallel program. However, classes should be designed with paralellism in mind to achieve maximum robustness. It is not always so easy to simply sprinkle in some MPI routines into a class after it has already been designed.

Conclusion

Due to the growing importance of quantum systems to the information technology industry as well as their intrinsic scientific interest, it is important to be able to simulate quantum dynamics as accurately and efficiently as possible. We have developed PSiQCoPATH in an attempt to apply the power of parallel computing to the accurate simulation of a very wide class of quantum problems. Validation was performed on several test systems and showed excellent agreement with analytical predictions.

While there are certainly issues we could have considered in greater detail, this project has been a huge success as a proof of concept. Quantum systems can indeed be simulated on parallel machines efficiently as a means for learning more about the quantum mechanics of various physical systems.



2005-05-12