MIT 18.337/6.338

The computation of the SVD is dominated by the `8/3n^3`

floating point operations
required by the Householder reduction of the matrix to bidiagonal form (for simplicity,
we assume that the matrix is square). The actual count includes lower order terms but they
are typically ignored. Assuming (wrongly) that the algorithm can be perfectly parallelized,
it would be possible to reduce the number of operations for each process to `8/3*n^3/p `

for p processes.
Parallel algorithms have to communicate between processes but the communication complexity is roughly
proportional to the size of the matrix, i.e. of the order `n^2 `

and will therefore be dominated
by the floating point operations for sufficiently large matrices. However, the speed of floating
operations is much faster than memory operations which are again faster than network operations.
Hence, in most practical applications of parallel linear algorithms, the lower order terms can’t
be ignored and the communication costs often influence the design and choice of parallel algorithm
such algorithms with a higher number of floating points end up being preferred because there is less
communication between the processes.

The communication costs associated with parallel algorithms means that there is a crossover point
at which the parallel algorithms becomes faster. We can approximate this crossover point by using
the units of a floating point operation to measure the run time of an algorithm. In these units,
the run time time of sequential algorithm is simply `8/3*n^3`

whereas the parallel
algorithm runs in `8/3*n^3/p + C*n^2 `

time where C is the relative cost of communication
compared to floating point operations. Notice that C also includes details specific to the algorithm
such this it will be higher for algorithms that require more communication. E.g. for a parallel LU
algorithm, C will be much smaller than the C associated with a parallel algorithm for reduction to
bidiagonal form. Hence, it is expected that the crossover point for the LU is lower than it is for
the SVD. From the two cost expressions, it is also clear that if communication (memory or network
operations) becomes slower relative to the processor then the crossover point will become larger.
Indeed, the speed development of processors have been much more rapid than memory and networks over
time so over time, the crossover point has grown.
Distributed linear algebra libraries such as Elemental and ScaLAPACK have had implementations of
parallel SVD solvers for a long time. This should allow for faster computation of the SVD on systems
with more than a single core. Since almost all computers now have more than a single computational core,
it should be possible to test the speedup even on laptop.

In the following, we will use Julia to measure the parallel implementations of LU and SVD in Elemental
and compare to the sequential implementation in LAPACK. Before measuring the different implementations
write down your expectations to the scaling and revisit after the runs. Try to explain the results.
You can create the timings on any system of your choice, laptop, big shared memory machine, a cluster
or cloud instances. Please specify all relevant details of the system and discuss how the system
might influence the results. Make sure to disable multithreading with Julia’s `BLAS.set_num_threads(1)`

(remember to do `using LinearAlgebra`

first) to avoid any interaction from multithreaded BLAS operations.

The parallel implementations are from the Elemental library. It is implemented in C++ but has a C API as well as Julia and Python wrappers. You can install the Julia wrappers by adding the Elemental package. It will automatically download and build Elemental on your machine so it will take a while to finish. If you prefer to use C, C++, or Python you can download the sources from http://libelemental.org and build the library on your own. The Julia wrappers are currently sparsely documented but follows the C API closely so it should be possible use the wrappers based on the official documentation for the C API together with reading the source of the Julia wrappers at https://github.com/JuliaParallel/Elemental.jl.

The Elemental library is based on MPI so it is required that there is a working MPI installation
available on the system where you are creating the timings. If you use your Mac for the timings,
then notice that only OpenMPI is supported. To test on multiple processes, you’d have launch you
script with `mpirun`

. E.g.

```
mpirun -np 4 julia time_lu.jl
```

```
using Elemental
```

Elemental.Init()

...your program...

Elemental.Finalize()

You can create a distributed Elemental matrix (DistMatrix) with Float64 elements with

`A = Elemental.DistMatrix(Float64)`

Notice that Elemental matrices are not initialized with a size. Instead the size is passed when filling the array. E.g. with Gaussian random variables as in

```
Elemental.gaussian(A, 100, 100)
```

`LAPACK.gebrd! `

reduces a matrix to bidiagonal form. (Remember to load Julia’s LinearAlgebra module first). Use random matrices to time `gebrd!`

. Compare to `svdvals`

and identify when the bidiagonal reduction dominates the computation. Plot the relative timings a a function of the matrix size.
`DistMatrix`

and time it for square matrices and double the size of the matrix until you run out of memory. Notice that Julia's `lu`

function will automatically dispatch to Elemental’s LU solver when called on an `Elemental.DistMatrix`

. Make an informative plot of the timings.
`svdvals`

function on an Elemental.DistMatrix. Make an informative plot of the timings.
Be sure to list the machine you are using, and try to use n, 2n, 4n, 8n pushing the limits of the memory of your machine. What do you observe. If possible try to get a hold of the biggest machine you can get your hands on.
This problem is researchy. Maybe there is no good answer. I was hoping that one might use the GSVD to morph one photo to another or at least on one end, one photo looks good and on another end, another photo looks good.

When I tried it, the clipping from 0 to 1 just seemed to ruin the whole thing. What do you think is a fix to this, if any? The best solutions could make a paper.

Refer to this notebook for more details.**Deadline for submitting both parts is Friday, 30th November at 11:59 PM. **