Homework 1: MPI and MATLAB*P

The purpose of this assignment is to get you started with parallel computing.

MIT students are welcome to use our Beowulf cluster or any other cluster at their disposal.

Take a look at Beowulf instructions for an introduction on using our class Beowulf.

Use of the PBS system is mandatory .

Outline

You are reminded to start on the homework early. Late homework will not be accepted, no matter what your reason is.

Instructions for submission:

For our sanity's sake, please follow the submission procedure:

Create a directory called 'your user name'-hw1
For example, cly-hw1
Create four subdirectories, prob1, prob2, prob3, prob4
Put your files (source code, txts) in the appropriate directory
tar your directory by: tar cf 'your user name'-hw1.tar 'your user name'-hw1
For example, tar cf cly-hw1.tar cly-hw1
gzip it:
gzip cly-hw1.tar
Put this file in the root of your home directory,
cp cly-hw1.tar.gz ~
The file will be collected at 2/13 11:59pm EST. This corresponds to 2/14 4:59am GMT, which the server uses.

Part I: Parallel substring search
Part II: Merge Sort
Part III: Power method
Part IV: How fast are those clusters

Part I: Parallel substring search

In lecture we talked about the "basic six" MPI routines:

MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Recv
MPI_Send

(refer to the list of MPI routines for definitions)

This problem will be an opportunity to try out these routines as well as learning about some more advanced routines.

Consider the following problem:

Let S be a random string of the alphabet Q = { A, C, G, T } with length mn. S is distributed among m processors such that processor i has a substring S_i of length n. S equals concat(S_0, S_1, ..., S_(m-1)).

Let I be an input string of the alphabet Q.

Write a program which, when run on m processors (-np m) does the following:

Ask user for input n.
Generate S of length mn by having each process generate the local random substring S_i.
Ask user for input I.
Locate all occurrence of I in the string S. Then collect and print the following statistics:
- The minimum 'gap' (in characters) between two occurrences
- The average 'gap' between two occurrences
- The maximum 'gap' between two occurrences
Loop back and ask user for input I. Repeat.

What should you submit

Source code
Executable, or Makefile that builds the executable

Hints/Clarifications:

You need to consider those substrings that 'straddle' between S_i and S_(i+1)
MPI_Reduce would be useful at some point
Q: In reasonably large data sets, the chance of having 0 instances of a letter on any 1 node would be extremely small, so is it required that the code be able to handle the 0 instance condition?
A: Well, the chance of having 0 instances also depends on the length of the pattern to be matched. But anyway, your code should handle the 0 instances case gracefully.
Q: Along the same lines, is the statistics calculation step required to be parallelized or can I bump the string up to the master node to calculate the statistics? I just want this clarified before I do more work than I need to.
A: The statistics calculation has to be parallelized.
Q: For the first problems of homework1, it says the string Icould straddle across S_{i} and S_{i+1}, is it possible that the string can also across more than 2 processes? Say e.g. I spans from S(2) to S(5)
A :Ignore cases of I spanning more than 2 processes, i.e. assumen I is at most of length n+1
IMPORTANT Q: On a different note, how do you define the distance between two strings in the first problem? Is it the distance between the first character in each string? or the distance between the last character of one string and the first of the next one? (Which might lead to complications if two strings overlap)
A: Good question. As a proper course 6 student I'll define it as the distance between the first character in each string.

By gap's between occurences, do you mean between two adjacent occurences or
each occurence and each other occurence?

I.e. in the following example:

string:
aTCTCTagTCT
substring:
TCT

Are the gaps 2(between first and second), 5 (between second and third)
OR
2 (between 1 and 2), 7 (between 1 and 3), and 5 (between 2 and 3).

A: Between 2 adjacent occurences. So 2,5.

Part II:

Merge Sort

Using MATLAB*P's mm mode, try to run a merge sort.

With np processors, generate a random vector of length 10000*np. Something like randn(10000*np*p,1) would work. Then, using mm mode, do a merge sort. The issue here is how do you proceed beyond the initial sort step - this is for you to figure out. MATLAB*P <-> frontend traffic is allowed, but can only be done in blocks of size 1000.

Constraint - at any time you are not allowed to have more than 10000 elements of the input data on each node, no matter where they come from.

What should you submit

mergesort.m - entry point. Takes in an distributed vector and returns the sorted version.
Any associated m-files.

Hints/Clarifications

Q: "at any time you are not allowed to have more than 10000 elements of the random vector on any one MATLAB process". What exactly does this mean? Does the merge sort have to happen "in place", without using additional memory? For example, does a = mm('sort',b) violate this constraint? I guess the main problem here is that I don't know how exactly the mm mode works and what exactly the constraint means.
A: Let the distributed vector to be sorted be x. The statement means each processor cannot hold more than 10000 elements of x at any time. You don't have to care about the internals of any function. The point is you cannot have more than 10000 elements of the input data.
Q: "MATLAB*P <-> frontend traffic is allowed, but can only be done in blocks of size 1000". What exactly is "MATLAB*P <-> frontend" traffic? Is that when you use the pp2matlab/matlab2pp function calls? Does this type of traffic occur at any other times? Is the final result supposed to a ddense object? If so, when does this type of traffic occur?
A: I don't remember if the architecture of MATLAB*P has been explained in lecture. But the MATLAB that you are seeing is the 'frontend' and MATLAB*P itself is a parallel server running in the back. All the distributed matrices that you do calculate on are on the backend. They are never brought to the frontend unless explicitly requested (e.g. pp2matlab, matlab2pp, or subsref like A(1:5)).
The point here is that since you have a limit (10000) of the number of elements you can hold on each proc, to do merge sort you will need to swap data around. mm mode does not allow message passing from one node to another, so it has to pass through the frontend. And this is allowed.
Q: Do the blocks really have to be sized EXACTLY at 1000? Or can the blocks be smaller?
A: They can be smaller.
Q: It seems that if you make an assignment like temp = A(1:10,1) where A is a distributed dense array, temp is no longer distributed. (if this is followed by 'whos' temp is listed as a double array). Isn't this passing data to the front end? If not, on which node would temp reside? (I have been assuming that it gets sent to the front end...)
A: Temp is now on the frontend (and this counts as data to frontend traffic). This behavior might change in the future, but not for this homework.
Q: As a note question -- the method I'm using for swapping the data works reliably -- and I am sorting the arrays, but it is fairly slow. I don't know how much I should be worried about this...
A: I won't be too worried. 100,000 elements (<1MB worth of data) is awfully small for a parallel application so communication cost will dominate.
Q: For the second problem, can we use frontend matlab to do merging? And is it that it can also hold no more than 10,000 data of x? Also Does 10000 data ofx means any 10000 data of x, no matter which part they come from x?
A: Yes you can use frontend to merge, and yes it can hold no more than 10000 element of x. 10000 elements mean any 10000 elements

Part III: Power method

We will investigate the largest eigenvalue of an infinite matrix A using power method. The entries in the matrix are defined as follows: for the entry at row i and column j, the value is A(i,j)=1/ [(i+j-2)(i+j-1)/2 + i] for 1<=i,j<=n. (Remember c is "0 based" but matrices are often "1 based")

Implement the following algorithm in MATLAB*P:

Starting with a random vector x, repeat the following:

y = Ax. (Matrix times vector)
n = sqrt(dot(y,y)) (MATLAB for sqrt of the sum of the squares of each element)
x = y/n (vector over scalar)

Use a matrix of size at least 2^16 square to approximate the infinite matrix. Note that you should NEVER form the matrix explicitly, as it will not fit in the memory of the machine. Instead, use the definition of the entries in your y = Ax computation. Use the 'mm mode' of MATLAB*P for this step.

The second and third step can be done in 'regular' MATLAB*P.

When to stop? n is an approximation to the largest eigenvalue of the matrix and should converge towards the true value. When the difference between n from the previous iteration and the current n is less than tol, stop and return n. Use tol=10>> data of ^x means any 10000 data of x, no matter which part they come

What to submit

largesteig.m - entry point.
Any associated m-files.

Hint/Clarification:

This guide to mm mode would be very helpful.
x should be a row-distributed col vector.
DON'T even try creating A in its entirety. NOT even as a test or anything. (2^16)^2 * 8 = 2^35 = 32GB. The cluster will run out of memory and very likely hang. As a rule of thumb, before creating a large matrix, do some "back of the envelope" calculation to see if it makes sense (we have 1GB on each node, shared by two processors. So treat it as 512MB per processor).
Note that by the way 'mm mode' works, each process will only have a local piece of the vector x. Think about how this will affect your computation of y = Ax, and add in any necessary step.
Q: Specifically, in problem 3, it seems that to do the y=Ax step in mm mode requires some way of letting each processor know which part of the matrix it has (since we don't actually want to pass parts of matrix A to each processor). How do we do this?
A: x is distributed row-wise (according to clarification). So if you think about how matrix-vector mult works, you'll see that the part of A that is used on each processor depends only on np, the number of processor. You are right that you don't want to pass A around. Even more than that, you don't even want to construct any part of A at all anywhere.

Part IV: How fast are those clusters

This question tries to bring your attention to the issues affecting the performance of a cluster.

Refer to the Supercomping @ MIT page, which list some clusters at MIT that we have information on. Within that page, pick 5 clusters. Your 5 choices must include all 3 different type of networks (Fast Ethernet, Gigabit Ethernet, Myrinet). Then try to estimate the High Performance Linpack benchmark (the benchmark used by top500) result on those clusters.

This is an open ended question - we don't know the answers either. So any answer that is reasonable and based on criteria relevant to cluster performance will be accepted.

How do you get started on this? Suggestions:

Google - search for clusters with HPL results. Compare.
Log into a cluster on which you have account. Run HPL for 1 nodes, 2 nodes ... interpolate.

What to submit

Your text/ps/pdf file

Ron Choy

Last modified: Sun Feb 9 22:32:13 GMT 2003