Project Proposal

18.337 - Applied Parallel Computing

Spring 2002

Andrew Menard

armenard@mit.edu

Abstract:

For the class project, I propose to take a program currently implemented for use on single processor machines, and modify it for use on parallel machines. The program to be modified is ClockDesigner, which is used for optimizing clock distribution networks on microchips. The current sequential algorithm, implemented in C++, suffers from long runtimes (hours to days for reasonably sized problems), and has some apparent parallelism, so it appears to be possible though non-trivial to modify it to benefit from running on clusters of 2-16 processors.

Background:

This project arises from my work at IBM, where I design and implement algorithms for physical design and layout of microchips. My area of interest recently has been clock signal distribution; there is a single clock signal generator somewhere on the chip, and the signal needs to be distributed to all of the latches on the chip such that it reaches them all simultaneously (within some small tolerance; the smaller the possible tolerance, the faster the chip will be able to run). This must be accomplished while minimizing usage of wire, number of transistors used for repeaters, and total delay in the system, while attempting to avoid causing noise on signal wires and the use of wire in congested areas of the chip. As a result, this is a difficult problem to solve, and a significant amount of effort has gone into implementing an efficient set of heuristics for it. Unfortunately, increasing problem sizes and the addition of new constraints to the problem are resulting in unacceptably long runtimes.

Current approach:

The current approach arranges the clock network as a tree, and optimizes one level of the tree at a time; different branches at the same level are mostly independent of each other. After all of the branches at the leaf level are optimized, the algorithm moves up to the next level, and repeats until it reaches the root. The nets at each level have several heuristics applied to them sequentially; the first one get a rough approximation of the optimal arrangement, while the later ones make small adjustments to find the local minimum (which is hopefully the global minimum if the initial heuristic was correct). After the final optimization to the nets at one level, the repeaters driving those nets are placed, and then used as the input to the next level as it is optimized.

Parallelizeability:

At each level, if there are two distinct sets of nets, they can be optimized separately. Moreover, several of the individual optimizations to a set of nets are done by trying several values of a parameter, which could be done in parallel, with all but the best result being discarded. The most obvious potential problem is the volume of data; the overhead of communicating all of the relevant information to a separate process and then returning the results to the master process might overwhelm the benefit from spreading out the computations; a typical run of the current algorithm might spend 30 minutes reading data from disk and then 90 minutes optimizing, which implies that mirroring a significant fraction of the data to a large number of processors might be a time consuming process. The second potential problem is that the parallel version of the algorithm needs to be transparent to the user and multiplatform; at present the code runs on AIX, Solaris, HP-UX, and linux, and a parallel version would ideally retain this cross-platform ability and would ideally not require the end user to configure anything.

Schedule:

Investigate parallel programming tools to determine usefulness to the project - 1 wk

Run timing analysis of current algorithm to verify hot spots; analyze code to check for sequential dependencies in and around them- 1wk

Write code to parallelize operation -1.5 wk

Test code, run timing analysis to determine benefits from parallel code -3.5 wk

Write report - 1wk