6.896 Project
Presentations
Improving Cilk
Adaptively Parallel
Processor Allocation for Cilk Jobs
|
|
|
Kunal Agrawal |
|
Siddhartha Sen |
Goal
|
|
|
Design and implement a dynamic
processor-allocation system for adaptively parallel jobs (jobs for which the
number of processors that can be used without waste varies during execution) |
|
The problem of allocating processors to
adaptively parallel jobs is called the adaptively parallel
processor-allocation problem [2]. |
|
At any given time, each job j has a
desire dj and an allotment mj |
Goal (cont.)
|
|
|
|
We want to design a
processor-allocation system that achieves a “fair” and “efficient” allocation
among all jobs. |
|
“fair” means that whenever a job
receives fewer processors than it desires, no other job receives more than
one more processor than this job received [2] |
|
“efficient” means that no job receives
more processors than it desires [2] |
Illustration of Problem
Main Algorithmic
Questions
|
|
|
|
Dynamically determine the desires of
each job in the system |
|
Heuristics on steal rate |
|
Heuristics on number of visible stack
frames |
|
Heuristics on number of threads in
ready deques |
|
Dynamically determine the allotment for
each job such that the resulting allocation is fair and efficient |
|
SRLBA algorithm [2] |
|
Cilk macroscheduler algorithm [3] |
|
Lottery scheduling techniques [5] |
Assumptions
|
|
|
All jobs on the system are Cilk jobs |
|
Jobs can enter and leave the system and
change their parallelism during execution |
|
All jobs are mutually trusting (they
will stay within the bound of their allotments and communicate their desires
honestly) |
|
Each job has at least one processor to
start with |
References
|
|
|
Supercomputing Technologies Group. Cilk
5.3.2 Reference Manual. MIT Lab for Computer Science, November 2001. |
|
B. Song. Scheduling adaptively parallel
jobs. Master's thesis, Massachusetts Institute of Technology, January 1998. |
|
R. D. Blumofe, C. E. Leiserson, and B.
Song. Automatic processor allocation for work-stealing jobs. |
|
C. A. Waldspurger. Lottery and Stride
Scheduling: Flexible Proportional-Share Resource Management. PhD thesis,
Massachusetts Institute of Technology, September 1995. |
|
C. A. Waldspurger and W. E. Weihl.
Lottery scheduling: Flexible proportional-share resource management. In Proceedings
of the First Symposium on Operating System Design and Implementation, pages
1-11. USENIX, November 1994. |
|
R. D. Blumofe, C. F. Joerg, B. C.
Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e±cient
multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming (PPoPP), pages
207-216, Santa Barbara, California, July 1995. |
|
|
Fast Serial-Append for Cilk
|
|
|
|
6.895 – Theory of Parallel Systems |
|
|
|
Project Presentation (Proposal) |
|
|
|
Alexandru Caracas |
Serial-Append (cont'd)
|
|
|
|
A Cilk dag |
|
Numbers represent the serial execution
of threads |
|
For example: |
|
Consider all threads perform file I/O
operations |
Serial-Append (cont'd)
|
|
|
|
Same Cilk dag |
|
Numbers represent the parallel
execution of threads |
|
Goal: |
|
We would like to execute the threads in
parallel while still allowing for the serial order to be reconstructed. |
Serial-Append (cont'd)
|
|
|
Partitioning of the Cilk dag |
|
Colors represent the I/O operations
done by a processor between two steal operations. |
Serial-Append
|
|
|
|
The Serial-Append of the computation is
obtained by reading the output in the order: |
|
orange (I) |
|
green (II) |
|
dark-red (III) |
|
purple (IV) |
|
red (V) |
Project Goals
|
|
|
|
Efficient serial-append algorithm |
|
possibly using B+-trees |
|
Analyze possible implementation: |
|
kernel level |
|
formatted file |
|
Cilk file system |
|
Implementation |
|
(one of the above) |
|
Port Cheerio to Cilk 5.4. |
|
Comparison with own implementation |
|
Port
serial applications to use serial-append (analyze their performance) |
References
|
|
|
Robert D. Blumofe and Charles E.
Leiserson. Scheduling multithreaded computations by work stealing. In
Proceedings of the 35th Annual Symposium on Foundations of Computer Science,
pages 356-368, Santa Fe, New Mexico, November 1994. |
|
Matthew S. DeBergalis. A parallel file
I/O API for Cilk. Master's thesis, Department of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology, June 2000. |
|
Supercomputing Technology Group MIT
Laboratory for Computer Science. Cilk 5.3.2 Reference Manual, November 2001.
Available at http://supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf. |
A Space-Efficient Global
Scheduler for Cilk
|
|
|
Jason Hickey |
|
Tyeler Quentmeyer |
Idea
|
|
|
Implement a new scheduler for Cilk
based on Narlikar and Blelloch’s Space-Efficient Scheduling of Nested
Parallelism paper |
AsyncDF Scheduler
|
|
|
Idea: minimize peak memory usage by
preempting threads that try to allocate large blocks of memory |
|
|
|
Algorithm: Roughly, AsyncDF runs the P
left-most threads in the computation DAG |
Example
Comparison to Cilk
|
|
|
|
Cilk |
|
Distributed |
|
Run threads like a serial program until
a processor runs out of work and has to steal |
|
Space:
p*S1 |
|
AsyncDF |
|
Global |
|
Preempt a thread when it tries to
allocate a large block of memory and replace it with a computationally heavy
thread |
|
Space: S1 + O(K*D*p) |
AsyncDF Optmizations
|
|
|
|
Two versions of scheduler |
|
Serial |
|
Parallel |
|
Running P left-most threads does not
exploit the locality of a serial program |
|
Solution: Look at the cP left-most
threads and “group” threads together for locality |
Project Plan
|
|
|
|
|
Implement serial version of AsyncDF |
|
Experiment with algorithms to group
threads for locality |
|
Performance comparison to Cilk |
|
Different values of AsyncDF’s tunable
parameter |
|
Case studies |
|
Recursive matrix multiplication |
|
Strassen multiplication |
|
N-body problem |
|
Implement parallel version of AsyncDF |
|
Performance comparison to Cilk |
|
Additional experiments/modifications as
research suggests |
Questions?
Automatic conversion of
NSP DAGs to SP DAGs
"Objectives :"
|
|
|
|
Objectives : |
|
Design of an efficient Algorithm to
convert an arbitrary NSP DAG to SP DAG |
|
- Correctness preserving |
|
Identifying different graph topologies |
|
|
|
Analysis of Space and Time complexities |
|
Bound on increase in critical path
length |
|
|
|
Perform empirical analysis using cilk |
|
|
Transactional Cilk
Language-Level Complex
Transactions
C. Scott Ananian
An Evaluation of Nested
Transactions
|
|
|
Sean Lie |
|
sean_lie@mit.edu |
|
|
|
6.895 Theory of Parallel Systems |
|
Wednesday October 15th, 2003 |
Transactional Memory
|
|
|
The programmer is given the ability to
define atomic regions of arbitrary size. |
|
|
|
Start_Transaction; |
|
Flow[i] = Flow[i] – X; |
|
Flow[j] = Flow[j] + X; |
|
End_Transaction; |
|
|
|
With hardware support, different
transactions can be executed concurrently when there are no memory conflicts. |
Nested Transactions
|
|
|
Nested transactions can be handled by
simply merging all inner transactions into the outermost transaction. |
Concurrency
|
|
|
Merging transactions is simple but may
decrease the program concurrency. |
|
|
Nested Concurrent
Transactions
|
|
|
|
Nested Concurrent Transactions (NCT): |
|
Allow the inner transaction to run and
complete independently from the outer transaction. |
|
Questions: |
|
When is NCT actually useful? |
|
How much overhead is incurred by NCT? |
|
How much do we gain from using NCT? |
|
The tool needed to get the answers: |
|
Transactional memory hardware simulator
– UVSIM |
|
UVSIM currently supports merging nested
transactions. |
|
How to get the answers: |
|
Identify applications where NCT is
useful. Look at malloc? |
|
Implement necessary infrastructure to
run NCTs in UVSIM. |
|
Evaluate performance of NCT vs. merging
transactions. |
Compiler Support for
Atomic Transactions in Cilk
|
|
|
Jim Sukha |
|
Tushara Karunaratna |
Determinacy Race Example
New Keywords in Cilk
Simulation of
Transactional Memory
Work Involved
|
|
|
Modify the Cilk parser to accept xbegin
and xend as keywords. |
|
Identify all load and store operations
in user code, and replace them with calls to functions from the runtime
system. |
|
Implement the runtime system. An initial implementation is to divide
memory into blocks and to use a hash table to store a lock and backup values
for each block. |
|
Experiment with different runtime
implementations. |
Data Race in
Transactional Cilk
Background
|
|
|
|
Current Cilk: achieve atomicity using
lock |
|
problems such as priority inversion,
deadlock, etc. |
|
Nontrivial to code |
|
Transactional memory |
|
Software TM: overhead -> slow |
|
Hardware TM |
|
|
Cilk’s transaction
everywhere
|
|
|
|
Every instruction becomes part of the a
transaction |
|
Based on HTM |
|
Cilk transaction: |
|
cut Cilk program into atomic pieces |
|
Base on some language construct or
compiler automatic generate |
Data Race
|
|
|
|
Very different from traditional
definition in Nondeterminator |
|
Assumption (correct parallelization) |
|
Definition (1st half of
Kai’s master thesis) |
|
Efficient detection (v.s. NP-complete
proved by Kai) |
|
Make use of current algorithms/tools |
|
|
Non-Determinacy Detection
Linear-time determinacy
race detection
Motivation
|
|
|
|
Nondeterminator has two parts: |
|
Check whether threads are logically
parallel |
|
Use shadow spaces to detect determinacy
race |
|
SP-Bags algorithm uses LCA lookup to
determine whether two threads are parallel |
|
LCA lookup with disjoint-sets data
structure takes O(α(v,v)) time. |
|
We do not care about the LCA. We just
want to know if two threads are logically parallel. |
New algorithm for
determinacy-race detection
|
|
|
|
Similar to SP-Bags algorithm with two
parts: |
|
Check whether threads are logically
parallel |
|
Use shadow spaces to detect determinacy
race |
|
Use order maintenance data-structure to
determine whether threads are parallel |
|
Gain: Order maintenance operations are O(1)
amortized time. |
The algorithm
|
|
|
|
|
Maintain two orders: |
|
Regular serial, depth-first, left-to
right execution. |
|
At each spawn, follow spawn thread
before continuation thread |
|
Right-to-left execution. |
|
At each spawn, follow continuation
thread before spawn thread |
|
Claim: Two threads e1 and e2
are parallel if and only if e1 precedes e2 in one
order, and e2 precedes e1 in the other. |
Depth-first,
left-to-right execution
Right-to-left execution
How to maintain both
orders in serial execution
|
|
|
|
Keep two pieces of state for each
procedure F: |
|
CF is current thread in F |
|
SF is next sync statement in
F. |
|
In both orders, insert new threads
after current thread |
|
On spawn, insert continuation thread
before spawn thread in one order. Do the opposite in the other. |
|
On sync, advance CF to SF |
|
In any other thread, update shadow
spaces based on current thread |
Example
Project Proposal:
Parallel Nondeterminator
Objective
|
|
|
I propose Parallel Nondeterminator to: |
|
Check the determinacy race in the
parallel execution of the program written in the language like Cilk |
|
Develop efficient algorithm to decide
the concurrency between threads |
|
Develop efficient algorithm to reduce
the number of entries in access history |
Primitive Idea in
Concurrency Test
|
|
|
(1) Labeling Scheme |
|
(2) Set operation |
|
Thread representation (fid, tid) |
|
|
"Two sets:"
|
|
|
Two sets: |
|
Parallel set PS(f) ={(pf, ptid) | all threads with tid >= ptid in function
pf is parallel with the running thread of f} |
|
Children set CS(f)={fid | fid is the
descendant of current function } |
|
|
|
Operations: |
|
Spawn: Thread Tx of function Fi spawn
function Fj |
|
Operations on child Fj |
|
|
|
|
|
|
|
Operations on parent Fi |
|
|
|
|
"Sync:"
|
|
|
Sync: Function Fi executes sync |
|
PS(Fi)=PS(Fi)-CS(Fi) |
|
|
|
Return: Function Fj returns to Function
Fi |
|
CS(Fi) = CS(Fi) + CS(Fj) |
|
Release PS(Fj) and CS(Fj) |
|
|
|
Concurrency Test: |
|
Check if (fx, tx) is parallel with the
current running thread (fc, tc): |
|
|
Primitive Idea for Access
History
|
|
|
Serial |
|
read(l), write(l) |
|
Simplest parallel program without
nested parallelism |
|
Two parallel read records, one write
record |
|
Language structure like Cilk |
|
read: max level of parallelism, one
write record |
|
|
|
Q: Is it possible to keep only two read
records for each shared location in Cilk Parallel Nondeterminator? |
|
Keep two parallel read records with
highest level of LCA in parent child spawn tree. |
"Thank you very
much!"
Parallel
Nondeterminator
Wang Junqing
Using Cilk
Accelerating
Multiprocessor Simulation
|
|
|
6.895 Project Proposal |
|
|
|
Kenneth C. Barr |
Introduction
|
|
|
|
|
Inspiration |
|
UVSIM is slow |
|
FlexSIM is cool |
|
Interleave short periods of detailed
simulation with long periods of functional warmup (eg., cache and predictor
updates, but not out-of-order logic) |
|
Achieve high accuracy in fraction of
the time |
|
Multi-configuration simulation is cool |
|
Recognize structural properties. E.G., “contents of FA cache are subset of
all larger FA caches with same line size” so search small->large. Once we hit is small cache, we can stop
searching |
|
Simulate many configurations with a
single run |
The Meta Directory
|
|
|
|
|
Combine previous ideas to speed
multiprocessor simulation |
|
Sequential Consistency is what matters,
perhaps detailed consistency mechanisms can be interleaved with a fast,
functional equivalent method |
|
|
|
|
|
|
Meta-directory example
|
|
|
|
Detailed simulation: |
|
P1 ld x ;null -> Sh, data sent to
P1 |
|
P2 ld x ;Sh -> Sh, data sent to P2 |
|
P3 ld x ;Sh- > Sh data sent to P3 |
|
P2 st x ;Sh -> Ex, inv sent to P1 and P3, data sent to P2 |
|
P1 st x ;Ex -> Ex, inv sent to P2,
data sent to P1 |
|
P2 ld x ;Ex -> Sh, data sent to P2 |
|
Shortcut |
|
Record access in a small
“meta-directory:”
x: {P1, 0, r}, {P2, 1, r}, {P3, 2, r}, {P2, 3, w}, {P1, 4, w}, {P2, 5, r} |
|
All reads and writes occur in a memory;
no messages sent or received, no directory modeled, no cache model in
processor (?) |
|
When it comes time for detailed
simulation, we can reconstruct directory by scanning backwards: x is shared
by P1 and P2. |
|
|
Challenges
|
|
|
Accesses stored in circular queue. How many records needed for each address? |
|
What happens when processor writes back
data, current scheme introduces false hits. |
|
Does this scheme always work? Some proofs would be nice. |
Methodology
|
|
|
|
Platform |
|
I got Simics to boot SMP Linux |
|
But Bochs is open-source |
|
X86 is key if this is to be long-term
framework |
|
Cilk benchmarks? |
|
Tasks |
|
Create detailed directory model |
|
Create meta directory |
|
Create a way to switch back and forth
maintaining program correctness |
|
Measure the dramatic improvement! |
Conclusion
|
|
|
|
Reality |
|
Sounds daunting for December deadline,
but if I can prove feasibility or fatal flaws, I’d be excited |
|
Suggestions? |
Project Proposal:
Parallelizing METIS
What is METIS?
|
|
|
Graph Partitioning algorithm |
|
|
|
Developed at University of Minnesota by
George Karypis and Vipin Kumar |
|
|
|
Information on METIS: |
|
http://www-users.cs.umn.edu/~karypis/metis/ |
Stages of Algorithm:
Coarsening
|
|
|
|
Task: Create sequentially smaller
graphs that make good representation of original graph by collapsing
connected nodes. |
|
|
|
Issues: |
|
Minor concurrency issues. |
|
Maintaining data locality. |
|
Writing large amount of data to memory
in a scalable fashion. |
Stages of Algorithm:
Initial partitioning.
|
|
|
Task: partition small graph |
|
|
|
Issues: none. Runtime of this portion
of algorithm very small. |
Stages of Algorithm:
Uncoarsening and Refinement
|
|
|
|
Uncoarsening and Refinement: project
partition of coarsest graph to original graph and refine partition. |
|
|
|
Issues: |
|
Major concurrency issues. |
|
Remaining issues under research. |
Parallel
Sorting
Paul Youn
|
|
|
I propose to parallelize a sorting
algorithm. Initially, looking at Quick Sort, but may investigate other
sorting algorithms. |
|
Sorting is a problem that can
potentially be sped up significantly (mergesort). |
|
As an alternative, considering other
algorithms. |
Goals
|
|
|
Similar to Lab 1 approach. |
|
Speedy serial algorithm. |
|
Basic parallel algorithm. |
|
Runtime Analysis. |
|
Empirical verification of runtime
analysis. |
|
Parallel speedups. |
HELP!
|
|
|
Not sure about the scope of the
problem. |
|
Anyone out there interested in this
stuff? |
|
Other appropriate algorithms? Max-flow? |
PROJECT
PROPOSAL
SMA5509
Implement FIR filter in parallel computing using Cilk
|
|
|
Name: Pham Duc Minh |
|
Matric: HT030502H |
|
Email: g0300231@nus.edu.sg |
|
phamducm@comp.nus.edu.sg |
Discrete Signal
|
|
|
|
|
|
|
the discrete input signal is x(n) |
|
the output signal is y(n) |
|
the impulse response h(n) |
|
|
|
|
FIR filter
|
|
|
|
|
|
|
Discrete time LTI (Linearity and Time
Invariance) systems can be classified into FIR and IIR. |
|
An FIR filter has impulse response h(n)
that extends only over a finite time interval, say 0≤ n ≤ M, and
is identically zero beyond that: |
|
{h0, h1, h2 . . . , hM, 0, 0, 0 . . .}
M is referred to as the filter order. |
|
The output y(n) in FIR is simplified to
the finite-sum form: |
|
|
|
|
|
|
|
or, explicitly |
|
|
Methods to process
|
|
|
|
1. Block processing |
|
a. Convolution |
|
|
|
|
|
|
|
|
|
|
b. Direct Form
The length of the input signal x(n) is L
The length of output y(n) will be Ly = Lx+Lh-1
c. Matrix
form
with the dimension of H is
"Overlap-Add Block
Convolution Method"
|
|
|
Overlap-Add Block Convolution Method |
|
above methods are not feasible in
infinitive or extremely long applications |
|
divide the long input into contiguous
non-overlapping blocks of manageable length, say L samples |
|
y0 =
h*x0 |
|
y1 =
h*x1 |
|
y2 =
h*x2 |
|
Thus the resulting block y0 starts at
absolute time n = 0 |
|
block y1 starts at n = L, and so on. |
|
The last M samples of each output block
will overlap with |
|
the first M outputs of the next block . |
"2."
|
|
|
2. Sampling Processing |
|
The direct form I/O convolutional
equation |
|
|
|
We define the internal states w1(n),
w2(2), w3(n)…. |
|
w0(n) = x(n) |
|
w1(n) = x(n-1) = w0(n-1) |
|
w2(n) = x(n-2) = w1(n-1) |
|
….. |
|
With this definition, we can y(n) in
the form |
"Goal of project"
|
|
|
Goal of project |
|
Use Cilk to implement FIR filter in
parallel |
|
Compare with the processes of DSP kit
(pipeline) and the process of FPGA (Field Programmable Gate Array) (also
parallel). |
|
|
"How to implement"
|
|
|
How to implement |
|
Multithreaded programming |
|
Cilk compiler |
Other things
|
|
|
If I finish soon, I hope additional
work suggestion from staff. |
Cache-Oblivious
Algorithms
SMA5509: Theory of
Parallel Systems
|
|
|
Sriram Saroop |
|
Advait D. Karande |
bzip2 compression
|
|
|
Lossless text compression scheme |
|
|
Burrows-Wheeler Transform
(1994)
|
|
|
A pre-processing step for data
compression |
|
Involves sorting of all rotations of
the block of data to be compressed |
|
Rationale: Compressed string transforms
better |
|
Worst case complexity of N2log(N),
where N = size of block to be sorted |
|
Can be improved to N(logN)2 (Sadakane
’98) |
Why make it
cache-oblivious?
|
|
|
Performance of BWT heavily dependent on
cache behavior (Seward ’00) |
|
Avoid slowdown for large files with
high degree of repetitiveness |
|
Especially useful in applications like
bzip2 that are required to perform well on different memory hierarchies |
|
|
Project Plan
|
|
|
Design and implementation of a cache
oblivious algorithm for BWT |
|
Performance comparisons with standard
bzip2 for available benchmarks |
|
Back-up: Optimizations for BWT
including parallelization |
|
|
|
|
New Cache-Oblivious
Algorithms
Zhang Jiahui and Neel Kamal
Cache-Oblivious Lock Free
Algorithms
Cache-Oblivious
Algorithms
|
|
|
|
Optimal memory access |
|
Matrix multiply, Fast Fourier Transform |
|
B-trees, Priority Queues |
|
|
|
Concurrent? |
|
|
|
Lock-free? |
|
At least one thread can always make
progress. |
|
|
Packed Memory Structure
|
|
|
|
Operations: |
|
Traverse(k) --
O(k/B) |
|
Insert
-- O(log2n/B) |
|
Delete
-- O(log2n/B) |
|
Goal: |
|
Design a lock-free version of the
cache-oblivious algorithm |
|
|
B-tree
|
|
|
|
Two algorithms |
|
Based on packed memory structure |
|
Based on alternative recursive
construction and pointer-swinging |
|
|
|
|