18.377 Class Project Progress

This is the progress report for Andrew Menard's class project for 18.337, applied parallel computing. The timeline as listed in the project proposal:

Investigate parallel programming tools to determine usefulness to the project - 1 wk

Run timing analysis of current algorithm to verify hot spots; analyze code to check for sequential dependencies in and around them- 1wk

Write code to parallelize operation -1.5 wk

Test code, run timing analysis to determine benefits from parallel code -3.5 wk

Write report - 1wk

The first point is done. I initially considered using MPI or openMP, since we have covered them in class and I now have some experience with MPI from the problem sets, but eventually settled on a system developed by my department. My reasoning was that ideally the system would be transparent to the user (aside from being much faster), both for other people who might be compiling my code, and for end users. This makes the use of a system such as the one used on the class beowulf suboptimal; requiring a user to do the extra steps of running lamboot, specifying at runtime how many processors are to be used and such are not desireable, and requiring that the compiler change to mpicc or some equivalent is also not desireable (switching compilers on a large project is a nightmare). Additionally, the code as written has several large datastructures that all processes will need access to, and it seemed like sending them via a message interface would be unnecessarily clumsy and time-consuming.

Luckily, my department already has an internally developed library which impliments a shared memory system, providing an API for allowing a program to fork extra processes, and allowing mutexes between processes for controling access to shared resources. On the downside, picking this toolset required me to learn a new set of procedures, but having it already installed on all of the relevant machines and operating transparently was too great an incentive to pass up.

I spent some time learning and experimenting with this system before proceeding to actually using it on the real software.

I then moved on to the second part, analysis of the algorithm used. There is an initial setup setp which is a constant overhead expense, and is probably not usefully parallelizeable, which covers reading in the design problem from disk and setting up a graphics window and such. Since this typically consumes a small fraction of the runtime, and does not grow much with problem size it is not very useful to parallelize. The initial heuristics used for getting a rough solution are an O(n log n) process, and looking at runlogs, typically consume under 10% of the runtime, less on large problems. While there is some potential for parallelism here, the fact that the fraction of runtime used by this section is decreasing as the problem size increases indicates that it is probably not worthwhile to spend much time optimizing it. The refinement optimizer, on the other hand, consumes more than 50% of the runtime on small problems, and as much as 90% on the largest ones. The current algorithm is O(n^2); it takes the collection of nets to be optimized, and then checks a pair of nets at a time to see if there is any pairwise improvement possible, repeating for all pairs. Since the order of growth is worse than the rest of the algorithms used, this is clearly in the long run going to dominate the runtime, and so should be the focus of optimization efforts.

Having verified through a number of testcases that this is the most timeconsuming portion of the code, and will continue to grow worse as problem sizes increase, I moved on to a detailed analysis of what it was doing and how it could be parallelized.

Basically, the refinement optimizer takes as input a collection of nets, a collection of constraints on nets, and several parameters to a scoring function. The nets are assumed to all be initially valid for all of the constraints, and the scoring function is assumed to be reasonably well behaved. The collection of nets is hopefully close to a global maximum of the scoring function, as this optimizer will make small improvements which should carry the collection towards the nearest local optimium, which is hopefully the global optimum.

Pairs of nets are chosen, and examined. Pins are swapped between the nets, with the constraints being checked after each swap to see if the nets are legal, and the score recalculated for these nets to see if it has improved. After the best solution is found for this pair of nets, another pair is chosen. After all pairs of nets are done, some minor work is done, and this entire process is repeated several times.

The most obvious way to parallelize this is to send pairs of nets off to separate processes; pin swapping between nets A and B can happen without interfering with another process looking at nets C and D. There needs to be a mutex between pairs that overlap; if one process is swapping pins between A and B and another is working on B and C, there will be potential problems due to net B being changed by both processes. However, order does not matter; either pair could be done first.

The code changes required for parallelizing this algorithm can be divided into several parts. First, some sort of scheduler needs to be written; at present the code works on A and B, then A and C, then A and D and so on, which is clearly not a good order to be sending them off to distinct processes, since a mutex on net A would stall all but the first process and force computation to be nearly serial. Ideally, the scheduler needs to deal with processes completing at different speeds; while it is usually going to be the case that the processor speeds are identical, there may be particular combinations of nets that are much easier or harder to deal with. Synchronizing after every pair of nets would make the scheduler much easier, at the expense of potentially causing most processes to stall waiting for a single slow one each time. A pair can only be inserted into the schedule when neither net in the pair is awaiting optimization or being optimized; the scheuler must delay such pairs until whaterer process is working on them is finished. The scheduler must also avoid painting itself into a corner; if it repeatedly schedules a subset of the nets, then near the end it could reach a situation where serial optimization is forced upon it; for example, if it for some reason it does not schedule any pair including net Z until all pairs not including net Z are done, then near the end a sequential computation will happen as processes optimize Z and A, then Z and B, and so on.

Writes to global variables need to be eliminated; things like global counters telling which pair of nets we are working on are potentially dangerous, since each process may want to increment such a counter, which would potentially mess up other prcesses, and therefore need to be recoded to be safe.

Code must be written to create the extra processes at the start of optimization, and shut them down at the end. Processes other than the master process must detect that they are not the master process, wait for the scheduler to flag a pair of nets as available, optimize them, and flag them as finished. The master process must detect that it is the master process, and begin scheduling pairs of nets to be optimized.

Code needs to be written for the non-master processes to acquire all the data needed, set the mutexes so that it has locks on all of the data, gracefully handle any locking situations that come up, and ensure that all locks are released after optimization of this pair is finished. After the scheduler reports no more net pairs are available, the process needs to clean itself up and end correctly (with no memory leaks, held locks, or other potential problems).

With these ideas of how to parallelize the code, I have started writing and testing it. Thus far I have been concentrating on the scheduler, which I believe will be the hardest part to get to work well. Thus far, it manages to create the extra processes, and creates an ordering of net pairs, but is failing to properly arrange the order such that the processes do not sometimes serialize waiting for locks. I anticipate spending the next couple of weeks debugging this and the rest of the code before moving on to serious testing of performance.