Parallel Simulation of Quantum Coherent Evolution by Parameterized Arbitrary Time-dependent Hamiltonians
Theory Implementation Results Files

Implementation

PSiQCoPATH is written in C++ with the MPI library of communication directives to control inter-processor communication. Throughout the design process, we tried to maintain the robustness and adaptability of our code through the techniques of object oriented programming (OOP). To this end, we invested a significant amount of time to the design of proper abstractions in our system. The subject of object oriented programming in the development of PSiQCoPATH will be discussed further in the section below.

Input Files

The input file format for PSiQCoPATH is simple and straightforward. Whitespace is used as the delimiter between entries. Complex numbers such as $4 + 5i$ are supplied as ordered pairs of their real and imaginary parts in parentheses, i.e. $(4,5)$.

Although there may be more efficient ways to store the input data, taking the most straightforward approach left us more time to concentrate on the algorithms and parallelization of our code. Furthermore, the simple whitespace delimited text file approach should not lead to any problems if PSiQCoPATH is ported across platforms. Figure 1 is an example of an input file for a very simple test run.

Figure 1: Example of a simple input file. The Hamiltonian is the identity in this case.
\begin{figure}\begin{center}
\begin{verbatim}alpha2
1
51 2 3 4 51 1 1 1 ...
...0 01 0
0 1alphaOut.out
1
1
1
state
1
0\end{verbatim}\end{center}\end{figure}

The entry of the first line is used to set the value of the isAlpha flag in the code. The value of this flag determines whether the program will run in Hamiltonian evolution mode or quantum circuit mode. In this example, the input "alpha" is used to set isAlpha to 1 and hence to instruct PSiQCoPATH to run in Hamiltonian evolution mode. In this case, the rest of the input is a used to set up the parameters of the desired time-dependent Hamiltonian.

If "qga" is instead specified, the program will run in quantum circuit mode. In this case, the remainder of the input file contains a series of unitary quantum gates to be applied to the system.

The next three lines specify the size of the Hilbert space, the number of terms in the parametric expansion of the Hamiltonian, and $T$, the number of time steps to be calculated. In the example, the Hilbert space has $2$ dimensions, the problem is parametrized by $1$ hamiltonian, and $5$ time steps are to be run.

In the line that follows, the user specifies the values of $t$ at each of the $T$ time steps. Note that this allows the possibility of non-uniform time steps. In the example, however, the time steps are equally sized.

The next three lines are used to specify the values of $\alpha(t)$, $\dot{\alpha}(t)$, and $\ddot{\alpha}(t)$ for the first basis Hamiltonian at each time step. In the example $\alpha(t) = 1$, and $\dot{\alpha} = \ddot{\alpha} = 0$.

Next, the matrix representation of the first ``basis'' Hamiltonian $\hat{H}_1$ is supplied. In the example, the two-dimensional identity matrix is used. In the general case where more than one basis Hamiltonian is required, the coefficients and basis Hamiltonians of the remaining pieces of the total Hamiltonian supplied next. The format for each piece is the same - a list of values for the coefficients and their derivatives followed by the matrix representation of the basis Hamiltonian.

The final few lines determine the parameters of the output. First, an output filename is specified. Next, the starting point of the output and frequency at which to store the state of the system are given. In the example, the code is instructed to start the output at the first time step, and only save one state vector.

Finally, the value of the outputType flag is set. The value ``evolution'' is used to instruct the code to perform a full time evolution matrix calculation. Alternatively, the value ``state'' is used to instruct the code to perform a single state evolution. In this case, the code will then read in the column vector corresponding to the desired initial state, the vector $\left(\begin{array}{cc} 1 \ 0\end{array}\right)$ in this example.

Figure 1 is a very simple example. For more realistic simulations of larger systems over longer time intervals, it was necessary to write additional scripts to generate the input files automatically. Examples of such scripts are grover.m (written in Matlab) and genAdQCInFile.cpp (written in C++). These scripts are included in the source code archive available on the PSiQCoPATH website.

Output

We also tried to maintain simplicity when creating output files. To aid in data analysis, the output files are created in a format easily read in by Matlab.

In Hamiltonian evolution mode, three output files are created. In the first file, the calculated states or matrices at the desired number of output checkpoints are written. A coefficient file is also created, containing the values of $t$, $\alpha(t)$, $\dot{\alpha}(t)$, and $\ddot{\alpha}(t)$ for each time step. The last file contains the relevant parameters of the calculation and total running time.

Parallel Prefix in the Calculation of the Time Evolution Operator

Using decomposition (5) from the section on the theory of PSiQCoPATH, a system's time evolution can be represented by the sequence of "incremental" time evolution operators
$\displaystyle \left\{\hbox{\usefont{U}{bbm}{m}{n} \selectfont 1}, \hat{U}(t_1,...
...
\hat{U}(t_3,   t_2)\hat{U}(t_2,   t_1)\hat{U}(t_1,  t_0) , \ldots\right\}$     (1)
The incremental time evolution operator of time step $k$, $\hat{U}(t_{k + 1},  t_k)$, depends only on the local values of $\hat{H}(t_k)$, $\dot{\hat{H}}(t_k)$, etc. As a result, the calculation of each time step's incremental evolution operator can be performed independently of all others. This fact makes the task of generating the set of operators

\begin{displaymath}\left\{\hat{U}(t_1, t_0), \hat{U}(t_2, t_1), \hat{U}(t_3, t_2), \ldots\right\}\end{displaymath}
"embarassingly parallel."

Expression (8) of the theory section shows the explicit form of the incremental evolution operators in terms of the Hamiltonian and its derivatives. In general to any order, this expression involving products of $\hat{H}$ and itself/its derivatives. On the surface, this seems to indicate that calculating the incremental evolution operators is an expensive operation requiring many matrix multiplication operations, each scaling as $\mathcal{O}(N^3)$. However, expansion (9) makes it possible to push the $\mathcal{O}(T \cdot N^3)$ cost of the matrix multiplication operations needed to calculate the incremental evolution operators for the entire run into a single group of matrix multiplications at the beginning of the computation. The number of matrix multiplications at this step depends on the order of the calculation and the number of basis Hamiltonians employed. Under most circumstances, only a few basis Hamiltonians are required and this step comes with minimal computational cost compared to the rest of the calculation.

The advantage comes from the observation that, up to second order for example, $\hat{U}(t_k + \Delta t,  t_k)$ is just a linear combination of the static basis Hamiltonians $\{\hat{H}_j\}$ and their binary products $\{\hat{H}_i\hat{H}_{j}\}$. The coefficients of this linear combination are just combinations of the (real) scalars $\alpha_j(t_k), \dot{\alpha}_j(t_k), \ldots$. Thus as the first ``pre-computation'' step of the calculation, PSiQCoPATH calculates the set of matrices

$\displaystyle \hat{B}_{ij} = \hat{H}_i\hat{H}_{j}.$     (2)
For a third order calculation, we would additionally need to calculate an analogous quantities $\hat{T}_{ijk} = \hat{H}_i\hat{H}_{j}\hat{H}_k$ and so forth. Once these multiplications have been performed, the calculation of each incremental evolution operator involves only matrix addition operations.

In our code, the simulation is divided into $p$ contiguous blocks of time of roughly equal size, where $p$ is the number of processors in use. Each processor generates and stores the incremental time evolution operators for all time steps within its allotted block of time.

Once the incremental time step evolution operators have been calculated, the final step is to combine them through matrix multiplication into the sequence (1). This is a straightforward application of the parallel prefix algorithm with matrix multiplication playing the role of the associative binary operator. Because matrix multiplication is non-commutative, it is critical to maintain proper operator ordering throughout the calculation.

In general, the number of time steps is much larger than the number of processors available. In this case, the parallel prefix algorithm begins with each processor performing a serial scan operation on its own block of data. Once these local scans are complete, the standard parallel prefix binary tree ascent/descent steps are performed on the top-most (latest time) elements of each processor's data. Finally, each processor other than the root performs a second serial update of its data by right-multiplying each of its own matrices by the top-most element of its earlier time neighbor.

When the number of time steps is much larger than the number of processors, the local serial scan steps dominate the running time to give $\mathcal{O}(2T\cdot N^3/p)$ scaling, where $T$ is the number of time steps, $N$ is the dimension of the system, and $p$ is the number of processors in use. Using an improvement discussed in the section on future work, the prefactor $2$ can be reduced to $1 + f$, where $f$ is the ratio of output steps requested by the user divided by the total number of time steps. Typically, this number is much less than $1$, leading to a nearly 2-fold speedup.

Row Distributed Multiplication

If the user is interested only in the evolution of a particular initial state, calculation of the full time evolution operator calculation is unnecessarily computationally expensive. That is, a particular initial state $\vert   \psi(t_0)   \rangle $ can be evolved to the state $\vert   \psi(t)   \rangle $ at a time $t > t_0$ at significantly lower computational cost than that of a full time evolution matrix calculation.

Using the method described above, we can calculate the incremental time evolution operators $\hat U(t_k +\Delta t,  t_k)$ to any desired order. The initial state can then be evolved by successively applying the incremental evolution operators to it. In terms of linear algebra, this is simply the problem of calculating: $U_1 x$, $U_2 U_1 x$, $U_3 U_2 U_1 x$, ..., $U_T U_{T - 1} \cdots U_1 x$ where $x$ is an $N\times 1$ column vector and each $U_i$ is an $N \times N$ unitary matrix.

Rather than using serial/parallel scan operations to first compute all of the matrix products and then multiply by the initial state vector to get the final result, a method that scales as $\mathcal{O}(N^3)$, the calculation can be performed using exclusively matrix-vector multiplication operations that scale as $\mathcal{O}(N^2)$. That is, the calculation starts by computing $U_1 x$. This result can the be used to calculate $U_2 U_1 x$ = $U_2 (U_1 x)$, and so on.

Because of its $N$-fold better scaling properies, we used a data-distributed version of this latter technique of successive matrix vector multiplication to parallelize the calculation of single state evolution. Let $p$ be the the number of processing units and $N$ be the dimension of the Hilbert Space. Let $T$ be the number of matrices in the operation, i.e. the length of the simulation. The algorithm is quite simple:

  1. Each processor stores approximately $N/p$ rows of each of the $T$ matrices $U_i$. Let $U_i^k$ be the ``local'' matrix stored by processor $k$. That is, $U_i^k$ contains $N/p$ rows of $U_i$.
  2. Let the vector $curRes = x$, where $x$ is the initial state of the system.
  3. for $i = 1$ to $T$:
    1. Each processor $k$ calculates $localRes = U_i^k \times curRes$
    2. Each processor broadcasts $localRes$ to all others. The next state vector is reconstructed from each processor's $localRes$ vector and stored in $curRes$

Runtime analysis of Row Distributed Matrix Vector Multiplications

The running time associated with this algorithm will be:

\begin{displaymath}
\mathcal{O} \left( \frac{N^2}{p} + C(N,p) \right) T
\end{displaymath} (3)

where $C(N,p)$ is the time associated with sharing the local result $localRes$ with all other processors. No MPI documentation we found could give a good estimate on running time of MPI_Allgather. Nonetheless, a safe upper bound is that $C(N,p) = \mathcal{O}(p (p-1) N/p) = \mathcal{O}(Np)$. That is, in the worst case, each processor sends $\mathcal{O}(N/p)$ data elements to the $(p - 1)$ other processes.

Thus, if $N$ is large, then $C(N,p)$ is dominated by the matrix vector multiplications. In this case, the running time is close to:

\begin{displaymath}
\mathcal{O} \left ( \frac{N^2T}{p} \right)
\end{displaymath} (4)

This result is highly desirable because it as a $p$-fold speedup over the ``ideal'' serial algorithm based on matrix-vector multiplication.

Future Work

One area where PSiQCoPATH could be improved siginificantly is in the area of memory management during parallel prefix runs. In its current form, the program first calculates and stores the incremental time evolution operators for all time steps. Each processor then performs a serial scan operation on its own block of data. The storage of all the incremental time evolution operators accounts for the vast majority of PSiQCoPATH's memory usage. This memory usage, in turn, is what limits the size of problem we are able to run.

We recently realized that it is not necessary to ever have all of the incremental time evolution operators stored at one time. In general, the number of output timesteps requested by the user is much less than the number of actual timesteps performed in the evolution (by a factor of perhaps 1000). Rather than storing all of the incremental operators, all we really need is that fraction of them corresponding to the much coarser output time step.

A much more efficent procedure would be to partially combine the first two steps in the following way. Let $\mathcal{N}$ be ratio of the total number of time steps to the number of output steps requested. Instead of storing every $\hat{U}(t_k + \Delta t,  t_k)$, we really only need to store each

$\hat{U}_{\mathcal{N}}(t_k + \mathcal{N}\Delta t,  t_k)$ = $\hat{U}(t_k + \mathcal{N}\Delta t,  t_k + (\mathcal{N} - 1)\Delta t)\cdots\hat{U}(t_k + \Delta t,  t_k)$.

We can build up $\hat{U}_{\mathcal{N}}$ by multiplying by each successive incremental time evolution operator as it is generated. Once $\mathcal{N}$ time steps have been calculated and combined, that $\hat{U}_{\mathcal{N}}$ can be stored in memory and the next one started. In this way, the memory requirements of the program will be cut roughly by a factor of $\mathcal{N}^{-1}$. As an additional benefit, the final local update step will also be shortened by a factor of $\mathcal{N}^{-1}$. Overall this leads to a speedup of approximately $2/(1 + \mathcal{N}^{-1})$ in the projected running time.

Also, we did not fully explore alternative basic arithmetic algorithms that could speedup the system. For instance, Strassen's algorithm for matrix multiplication runs asymptotically faster than $n^3$. However, this algorithm has significantly different memory usage. Also, its overhead is much larger than standard matrix multiply. Thus, simply using Strassen's algorithm would not necessarily be an improvement. Possible changes of this sort are worth considering, and can easily be integrated and tested in our code due to its object-based construction.


Case Study on Object Oriented Parallel Programming

Throughout the course on parallel computing, most of the skeletal code we were given was not object-oriented. We made it a point to make our code as object-oriented as possible. Over the course of the project we found that many aspects of object oriented programming carry over directly to the parallel setting, but we also encountered some new challenges unique to the playing field of parallelism.

At a high level, the big advantage of object-oriented programming is the power of abstraction. We employed such abstraction with objects that we knew would be parallelized. For instance, in the ComplexMatrix class, we have methods such as send, receive, and rowDist to communicate matrix objects between processors.

The send and receive methods were relatively simple and worked well. These routines simply send entire matrices to/from other processors via the MPI. An alternative approach would have been to define a new MPI datatype for objects of the ComplexMatrix class, but we found it much simpler to simply add these communication methods in the class itself.

The rowDist method was somewhat awkward. Its goal was to distribute the data of a matrix stored completely on a single processor to all other processors in row-wise fashion. Similar to an MPI_Bcast command, every processor calls rowDist. Before calling this command, however, each processor had to determine how many rows it would store. This added an additional step to the method, but was not an insurmountable challenge.

Overall, we found using objected oriented techniques to be very useful in maintaining abstractions in an MPI-driven parallel program. However, classes should be designed with paralellism in mind to achieve maximum robustness. It is not always so easy to simply sprinkle in some MPI routines into a class after it has already been designed.