Notes
Slide Show
Outline
1
6.896 Project Presentations
  • October 15-17 2003
2
Improving Cilk
3
Adaptively Parallel Processor Allocation for Cilk Jobs
  • Kunal Agrawal
  • Siddhartha Sen
4
Goal
  • Design and implement a dynamic processor-allocation system for adaptively parallel jobs (jobs for which the number of processors that can be used without waste varies during execution)
  • The problem of allocating processors to adaptively parallel jobs is called the adaptively parallel processor-allocation problem [2].
  • At any given time, each job j has a desire dj and an allotment mj
5
Goal (cont.)
  • We want to design a processor-allocation system that achieves a “fair” and “efficient” allocation among all jobs.
    • “fair” means that whenever a job receives fewer processors than it desires, no other job receives more than one more processor than this job received [2]
    • “efficient” means that no job receives more processors than it desires [2]
6
Illustration of Problem
7
Main Algorithmic Questions
  • Dynamically determine the desires of each job in the system
    • Heuristics on steal rate
    • Heuristics on number of visible stack frames
    • Heuristics on number of threads in ready deques
  • Dynamically determine the allotment for each job such that the resulting allocation is fair and efficient
    • SRLBA algorithm [2]
    • Cilk macroscheduler algorithm [3]
    • Lottery scheduling techniques [5]
8
Assumptions
  • All jobs on the system are Cilk jobs
  • Jobs can enter and leave the system and change their parallelism during execution
  • All jobs are mutually trusting (they will stay within the bound of their allotments and communicate their desires honestly)
  • Each job has at least one processor to start with
9
References
  • Supercomputing Technologies Group. Cilk 5.3.2 Reference Manual. MIT Lab for Computer Science, November 2001.
  • B. Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998.
  • R. D. Blumofe, C. E. Leiserson, and B. Song. Automatic processor allocation for work-stealing jobs.
  • C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis, Massachusetts Institute of Technology, September 1995.
  • C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In Proceedings of the First Symposium on Operating System Design and Implementation, pages 1-11. USENIX, November 1994.
  • R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e±cient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207-216, Santa Barbara, California, July 1995.


10
Fast Serial-Append for Cilk
    • 6.895 – Theory of Parallel Systems


    • Project Presentation (Proposal)


    • Alexandru Caracas
11
Serial-Append (cont'd)
  • A Cilk dag
  • Numbers represent the serial execution of threads
  • For example:
    • Consider all threads perform file I/O operations
12
Serial-Append (cont'd)
  • Same Cilk dag
  • Numbers represent the parallel execution of threads
  • Goal:
    • We would like to execute the threads in parallel while still allowing for the serial order to be reconstructed.
13
Serial-Append (cont'd)
  • Partitioning of the Cilk dag
  • Colors represent the I/O operations done by a processor between two steal operations.
14
Serial-Append
  • The Serial-Append of the computation is obtained by reading the output in the order:
    • orange (I)
    • green (II)
    • dark-red (III)
    • purple (IV)
    • red (V)
15
Project Goals
  • Efficient serial-append algorithm
    • possibly using B+-trees
  • Analyze possible implementation:
    • kernel level
    • formatted file
    • Cilk file system
  • Implementation
    • (one of the above)
  • Port Cheerio to Cilk 5.4.
  • Comparison with own implementation
  • Port  serial applications to use serial-append (analyze their performance)
16
References
  • Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356-368, Santa Fe, New Mexico, November 1994.
  • Matthew S. DeBergalis. A parallel file I/O API for Cilk. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 2000.
  • Supercomputing Technology Group MIT Laboratory for Computer Science. Cilk 5.3.2 Reference Manual, November 2001. Available at http://supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf.
17
A Space-Efficient Global Scheduler for Cilk
  • Jason Hickey
  • Tyeler Quentmeyer
18
Idea
  • Implement a new scheduler for Cilk based on Narlikar and Blelloch’s Space-Efficient Scheduling of Nested Parallelism paper
19
AsyncDF Scheduler
  • Idea: minimize peak memory usage by preempting threads that try to allocate large blocks of memory


  • Algorithm: Roughly, AsyncDF runs the P left-most threads in the computation DAG
20
Example
21
Comparison to Cilk
  • Cilk
    • Distributed
    • Run threads like a serial program until a processor runs out of work and has to steal
    • Space:  p*S1
  • AsyncDF
    • Global
    • Preempt a thread when it tries to allocate a large block of memory and replace it with a computationally heavy thread
    • Space: S1 + O(K*D*p)
22
AsyncDF Optmizations
  • Two versions of scheduler
    • Serial
    • Parallel
  • Running P left-most threads does not exploit the locality of a serial program
    • Solution: Look at the cP left-most threads and “group” threads together for locality
23
Project Plan
  • Implement serial version of AsyncDF
    • Experiment with algorithms to group threads for locality
    • Performance comparison to Cilk
    • Different values of AsyncDF’s tunable parameter
    • Case studies
      • Recursive matrix multiplication
      • Strassen multiplication
      • N-body problem
  • Implement parallel version of AsyncDF
    • Performance comparison to Cilk
  • Additional experiments/modifications as research suggests
24
Questions?
25
Automatic conversion of NSP DAGs to SP DAGs
  • Sajindra Jayasena
26
"Objectives :"
  • Objectives :
  • Design of an efficient Algorithm to convert an arbitrary NSP DAG to SP DAG
    • - Correctness preserving
    • Identifying different graph topologies


  • Analysis of Space and Time complexities
    • Bound on increase in critical path length


  • Perform empirical analysis using cilk


27
Transactional Cilk
28
Language-Level Complex Transactions
 C. Scott Ananian
29
An Evaluation of Nested Transactions
  • Sean Lie
  • sean_lie@mit.edu


  • 6.895 Theory of Parallel Systems
  • Wednesday October 15th, 2003
30
Transactional Memory
  • The programmer is given the ability to define atomic regions of arbitrary size.


  • Start_Transaction;
  • Flow[i] = Flow[i] – X;
  • Flow[j] = Flow[j] + X;
  • End_Transaction;


  • With hardware support, different transactions can be executed concurrently when there are no memory conflicts.
31
Nested Transactions
  • Nested transactions can be handled by simply merging all inner transactions into the outermost transaction.
32
Concurrency
  • Merging transactions is simple but may decrease the program concurrency.


33
Nested Concurrent Transactions
  • Nested Concurrent Transactions (NCT):
    • Allow the inner transaction to run and complete independently from the outer transaction.
  • Questions:
    • When is NCT actually useful?
    • How much overhead is incurred by NCT?
    • How much do we gain from using NCT?
  • The tool needed to get the answers:
    • Transactional memory hardware simulator – UVSIM
    • UVSIM currently supports merging nested transactions.
  • How to get the answers:
    • Identify applications where NCT is useful. Look at malloc?
    • Implement necessary infrastructure to run NCTs in UVSIM.
    • Evaluate performance of NCT vs. merging transactions.
34
Compiler Support for Atomic Transactions in Cilk
  • Jim Sukha
  • Tushara Karunaratna
35
Determinacy Race Example
36
New Keywords in Cilk
37
Simulation of Transactional Memory
38
Work Involved
  • Modify the Cilk parser to accept xbegin and xend as keywords.
  • Identify all load and store operations in user code, and replace them with calls to functions from the runtime system.
  • Implement the runtime system.  An initial implementation is to divide memory into blocks and to use a hash table to store a lock and backup values for each block.
  • Experiment with different runtime implementations.
39
Data Race in Transactional Cilk
  • Xie Yong
40
Background
  • Current Cilk: achieve atomicity using lock
    • problems such as priority inversion, deadlock, etc.
    • Nontrivial to code
  • Transactional memory
    • Software TM: overhead -> slow
    • Hardware TM

41
Cilk’s transaction everywhere
  • Every instruction becomes part of the a transaction
  • Based on HTM
  • Cilk transaction:
    • cut Cilk program into atomic pieces
    • Base on some language construct or compiler automatic generate
42
Data Race
  • Very different from traditional definition in Nondeterminator
  • Assumption (correct parallelization)
  • Definition (1st half of Kai’s master thesis)
    • Efficient detection (v.s. NP-complete proved by Kai)
    • Make use of current algorithms/tools

43
Non-Determinacy Detection
44
Linear-time determinacy race detection
  • Jeremy Fineman
45
Motivation
  • Nondeterminator has two parts:
    • Check whether threads are logically parallel
    • Use shadow spaces to detect determinacy race
  • SP-Bags algorithm uses LCA lookup to determine whether two threads are parallel
  • LCA lookup with disjoint-sets data structure takes O(α(v,v)) time.
  • We do not care about the LCA. We just want to know if two threads are logically parallel.
46
New algorithm for determinacy-race detection
  • Similar to SP-Bags algorithm with two parts:
    • Check whether threads are logically parallel
    • Use shadow spaces to detect determinacy race
  • Use order maintenance data-structure to determine whether threads are parallel
  • Gain: Order maintenance operations are O(1) amortized time.
47
The algorithm
  • Maintain two orders:
    • Regular serial, depth-first, left-to right execution.
      • At each spawn, follow spawn thread before continuation thread
    • Right-to-left execution.
      • At each spawn, follow continuation thread before spawn thread
  • Claim: Two threads e1 and e2 are parallel if and only if e1 precedes e2 in one order, and e2 precedes e1 in the other.
48
Depth-first, left-to-right execution
49
Right-to-left execution
50
How to maintain both orders in serial execution
  • Keep two pieces of state for each procedure F:
    • CF is current thread in F
    • SF is next sync statement in F.
  • In both orders, insert new threads after current thread
  • On spawn, insert continuation thread before spawn thread in one order. Do the opposite in the other.
  • On sync, advance CF to SF
  • In any other thread, update shadow spaces based on current thread
51
Example
52
Project Proposal:

Parallel Nondeterminator
  • He Yuxiong
53
Objective
  • I propose Parallel Nondeterminator to:
  • Check the determinacy race in the parallel execution of the program written in the language like Cilk
  • Develop efficient algorithm to decide the concurrency between threads
  • Develop efficient algorithm to reduce the number of entries in access history
54
Primitive Idea in Concurrency Test
  • (1) Labeling Scheme
  • (2) Set operation
  •    Thread representation (fid, tid)
55
"Two sets:"
  • Two sets:
  • Parallel set PS(f) ={(pf, ptid) |  all threads with tid >= ptid in function pf is parallel with the running thread of f}
  • Children set CS(f)={fid | fid is the descendant of current function }


  • Operations:
  • Spawn: Thread Tx of function Fi spawn function Fj
  • Operations on child Fj
  • Operations on parent Fi



56
"Sync:"
  • Sync: Function Fi executes sync
  • PS(Fi)=PS(Fi)-CS(Fi)


  • Return: Function Fj returns to Function Fi
  • CS(Fi) = CS(Fi) + CS(Fj)
  • Release PS(Fj) and CS(Fj)


  • Concurrency Test:
  • Check if (fx, tx) is parallel with the current running thread (fc, tc):


57
Primitive Idea for Access History
  • Serial
  • read(l), write(l)
  • Simplest parallel program without nested parallelism
  • Two parallel read records, one write record
  • Language structure like Cilk
  • read: max level of parallelism, one write record
  • Q: Is it possible to keep only two read records for each shared location in Cilk Parallel Nondeterminator?
  • Keep two parallel read records with highest level of LCA in parent child spawn tree.
58
"Thank you very much!"
  • Thank you very much!
59
Parallel Nondeterminator
Wang Junqing
60
Using Cilk
61
Accelerating Multiprocessor Simulation
  • 6.895 Project Proposal


  • Kenneth C. Barr
62
Introduction
  • Inspiration
    • UVSIM is slow
    • FlexSIM is cool
      • Interleave short periods of detailed simulation with long periods of functional warmup (eg., cache and predictor updates, but not out-of-order logic)
      • Achieve high accuracy in fraction of the time
    • Multi-configuration simulation is cool
      • Recognize structural properties.  E.G., “contents of FA cache are subset of all larger FA caches with same line size” so search small->large.  Once we hit is small cache, we can stop searching
      • Simulate many configurations with a single run
63
The Meta Directory
  • Combine previous ideas to speed multiprocessor simulation
  • Sequential Consistency is what matters, perhaps detailed consistency mechanisms can be interleaved with a fast, functional equivalent method




64
Meta-directory example
  • Detailed simulation:
    • P1 ld x ;null -> Sh, data sent to P1
    • P2 ld x ;Sh -> Sh, data sent to P2
    •    P3 ld x ;Sh- > Sh data sent to P3
    •    P2 st x ;Sh -> Ex, inv sent to P1 and P3, data sent to P2
    • P1 st x ;Ex -> Ex, inv sent to P2, data sent to P1
    • P2 ld x ;Ex -> Sh, data sent to P2
  • Shortcut
    • Record access in a small “meta-directory:”
      x: {P1, 0, r}, {P2, 1, r}, {P3, 2, r}, {P2, 3, w}, {P1, 4, w}, {P2, 5, r}
    • All reads and writes occur in a memory; no messages sent or received, no directory modeled, no cache model in processor (?)
    • When it comes time for detailed simulation, we can reconstruct directory by scanning backwards: x is shared by P1 and P2.

65
Challenges
  • Accesses stored in circular queue.  How many records needed for each address?
  • What happens when processor writes back data, current scheme introduces false hits.
  • Does this scheme always work?  Some proofs would be nice.
66
Methodology
  • Platform
    • I got Simics to boot SMP Linux
    • But Bochs is open-source
    • X86 is key if this is to be long-term framework
    • Cilk benchmarks?
  • Tasks
    • Create detailed directory model
    • Create meta directory
    • Create a way to switch back and forth maintaining program correctness
    • Measure the dramatic improvement!
67
Conclusion
  • Reality
    • Sounds daunting for December deadline, but if I can prove feasibility or fatal flaws, I’d be excited
  • Suggestions?
68
Project Proposal: Parallelizing METIS
  • Zardosht Kasheff
69
What is METIS?
  • Graph Partitioning algorithm


  • Developed at University of Minnesota by George Karypis and Vipin Kumar


  • Information on METIS:
  • http://www-users.cs.umn.edu/~karypis/metis/
70
Stages of Algorithm: Coarsening
  • Task: Create sequentially smaller graphs that make good representation of original graph by collapsing connected nodes.


  • Issues:
    • Minor concurrency issues.
    • Maintaining data locality.
    • Writing large amount of data to memory in a scalable fashion.
71
Stages of Algorithm: Initial partitioning.
  • Task: partition small graph


  • Issues: none. Runtime of this portion of algorithm very small.
72
Stages of Algorithm: Uncoarsening and Refinement
  • Uncoarsening and Refinement: project partition of coarsest graph to original graph and refine partition.


  • Issues:
    • Major concurrency issues.
    • Remaining issues under research.
73
Parallel Sorting
Paul Youn
  • I propose to parallelize a sorting algorithm. Initially, looking at Quick Sort, but may investigate other sorting algorithms.
  • Sorting is a problem that can potentially be sped up significantly (mergesort).
  • As an alternative, considering other algorithms.
74
Goals
  • Similar to Lab 1 approach.
  • Speedy serial algorithm.
  • Basic parallel algorithm.
  • Runtime Analysis.
  • Empirical verification of runtime analysis.
  • Parallel speedups.
75
HELP!
  • Not sure about the scope of the problem.
  • Anyone out there interested in this stuff?
  • Other appropriate algorithms? Max-flow?
76
PROJECT PROPOSAL
SMA5509
Implement FIR filter in parallel computing using Cilk
  • Name: Pham Duc Minh
  • Matric: HT030502H
  • Email: g0300231@nus.edu.sg
  •            phamducm@comp.nus.edu.sg
77
Discrete Signal
          • the discrete input signal is x(n)
          • the output signal is y(n)
          • the impulse response h(n)



78
FIR filter
          • Discrete time LTI (Linearity and Time Invariance) systems can be classified into FIR and IIR.
          • An FIR filter has impulse response h(n) that extends only over a finite time interval, say 0≤ n ≤ M, and is identically zero beyond that:
          •    {h0, h1, h2 . . . , hM, 0, 0, 0 . . .}
            M is referred to as the filter order.
          • The output y(n) in FIR is simplified to the finite-sum form:




          • or, explicitly


79
Methods to process
    • 1. Block processing
    • a. Convolution






80
b. Direct Form

The length of the input signal x(n) is L

The length of output y(n) will be Ly = Lx+Lh-1
81
c. Matrix form







with the dimension of H is
82
"Overlap-Add Block Convolution Method"
  • Overlap-Add Block Convolution Method
  • above methods are not feasible in infinitive or extremely long applications
  • divide the long input into contiguous non-overlapping blocks of manageable length, say L samples
  • y0 =  h*x0
  • y1 =  h*x1
  • y2 =  h*x2
  • Thus the resulting block y0 starts at absolute time n = 0
  • block y1 starts at n = L, and so on.
  • The last M samples of each output block will overlap with
  • the first M outputs of the next block .
83
"2."
  • 2. Sampling Processing
  • The direct form I/O convolutional equation


  • We define the internal states w1(n), w2(2), w3(n)….
  • w0(n) = x(n)
  • w1(n) = x(n-1) = w0(n-1)
  • w2(n) = x(n-2) = w1(n-1)
  • …..
  • With this definition, we can y(n) in the form
84
"Goal of project"
  • Goal of project
  • Use Cilk to implement FIR filter in parallel
  • Compare with the processes of DSP kit (pipeline) and the process of FPGA (Field Programmable Gate Array) (also parallel).


85
"How to implement"
  • How to implement
  • Multithreaded programming
  • Cilk compiler
86
Other things
  • If I finish soon, I hope additional work suggestion from  staff.
87
Cache-Oblivious Algorithms
88
SMA5509: Theory of Parallel Systems
  • Sriram Saroop
  • Advait D. Karande
89
bzip2 compression
  • Lossless text compression scheme


90
Burrows-Wheeler Transform (1994)
  • A pre-processing step for data compression
  • Involves sorting of all rotations of the block of data to be compressed
  • Rationale: Compressed string transforms better
  • Worst case complexity of N2log(N), where N = size of block to be sorted
  • Can be improved to N(logN)2 (Sadakane ’98)
91
Why make it cache-oblivious?
  • Performance of BWT heavily dependent on cache behavior (Seward ’00)
  • Avoid slowdown for large files with high degree of repetitiveness
  • Especially useful in applications like bzip2 that are required to perform well on different memory hierarchies


92
Project Plan
  • Design and implementation of a cache oblivious algorithm for BWT
  • Performance comparisons with standard bzip2 for available benchmarks
  • Back-up: Optimizations for BWT including parallelization



93
New Cache-Oblivious Algorithms
Zhang Jiahui and Neel Kamal
94
Cache-Oblivious Lock Free Algorithms
  • Seth Gilbert
95
Cache-Oblivious Algorithms
  • Optimal memory access
    • Matrix multiply, Fast Fourier Transform
    • B-trees, Priority Queues


  • Concurrent?


  • Lock-free?
    • At least one thread can always make progress.

96
Packed Memory Structure
  • Operations:
    • Traverse(k)  --  O(k/B)
    • Insert  --  O(log2n/B)
    • Delete  -- O(log2n/B)
  • Goal:
    • Design a lock-free version of the cache-oblivious algorithm


97
B-tree
  • Two algorithms
    • Based on packed memory structure
    • Based on alternative recursive construction and pointer-swinging