6.896 Project Presentations

October 15-17 2003

Adaptively Parallel Processor Allocation for Cilk Jobs

Kunal Agrawal

Siddhartha Sen

Goal

Design and implement a dynamic processor-allocation system for adaptively parallel jobs (jobs for which the number of processors that can be used without waste varies during execution)

The problem of allocating processors to adaptively parallel jobs is called the adaptively parallel processor-allocation problem [2].

At any given time, each job j has a desire d_j and an allotment m_j

Goal (cont.)

We want to design a processor-allocation system that achieves a “fair” and “efficient” allocation among all jobs.

“fair” means that whenever a job receives fewer processors than it desires, no other job receives more than one more processor than this job received [2]

“efficient” means that no job receives more processors than it desires [2]

Illustration of Problem

Main Algorithmic Questions

Dynamically determine the desires of each job in the system

Heuristics on steal rate

Heuristics on number of visible stack frames

Heuristics on number of threads in ready deques

Dynamically determine the allotment for each job such that the resulting allocation is fair and efficient

SRLBA algorithm [2]

Cilk macroscheduler algorithm [3]

Lottery scheduling techniques [5]

Assumptions

All jobs on the system are Cilk jobs

Jobs can enter and leave the system and change their parallelism during execution

All jobs are mutually trusting (they will stay within the bound of their allotments and communicate their desires honestly)

Each job has at least one processor to start with

References

Supercomputing Technologies Group. Cilk 5.3.2 Reference Manual. MIT Lab for Computer Science, November 2001.

B. Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998.

R. D. Blumofe, C. E. Leiserson, and B. Song. Automatic processor allocation for work-stealing jobs.

C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis, Massachusetts Institute of Technology, September 1995.

C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In Proceedings of the First Symposium on Operating System Design and Implementation, pages 1-11. USENIX, November 1994.

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e±cient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207-216, Santa Barbara, California, July 1995.

Fast Serial-Append for Cilk

6.895 – Theory of Parallel Systems

Project Presentation (Proposal)

Alexandru Caracas

Serial-Append (cont'd)

A Cilk dag

Numbers represent the serial execution of threads

For example:

Consider all threads perform file I/O operations

Serial-Append (cont'd)

Same Cilk dag

Numbers represent the parallel execution of threads

Goal:

We would like to execute the threads in parallel while still allowing for the serial order to be reconstructed.

Serial-Append (cont'd)

Partitioning of the Cilk dag

Colors represent the I/O operations done by a processor between two steal operations.

Serial-Append

The Serial-Append of the computation is obtained by reading the output in the order:

orange (I)

green (II)

dark-red (III)

purple (IV)

red (V)

Project Goals

Efficient serial-append algorithm

possibly using B+-trees

Analyze possible implementation:

kernel level

formatted file

Cilk file system

Implementation

(one of the above)

Port Cheerio to Cilk 5.4.

Comparison with own implementation

Port serial applications to use serial-append (analyze their performance)

References

Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356-368, Santa Fe, New Mexico, November 1994.

Matthew S. DeBergalis. A parallel file I/O API for Cilk. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 2000.

Supercomputing Technology Group MIT Laboratory for Computer Science. Cilk 5.3.2 Reference Manual, November 2001. Available at http://supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf.

A Space-Efficient Global Scheduler for Cilk

Jason Hickey

Tyeler Quentmeyer

Idea

Implement a new scheduler for Cilk based on Narlikar and Blelloch’s Space-Efficient Scheduling of Nested Parallelism paper

AsyncDF Scheduler

Idea: minimize peak memory usage by preempting threads that try to allocate large blocks of memory

Algorithm: Roughly, AsyncDF runs the P left-most threads in the computation DAG

Example

Comparison to Cilk

Cilk

Distributed

Run threads like a serial program until a processor runs out of work and has to steal

Space: p*S₁

AsyncDF

Global

Preempt a thread when it tries to allocate a large block of memory and replace it with a computationally heavy thread

Space: S₁ + O(K*D*p)

AsyncDF Optmizations

Two versions of scheduler

Serial

Parallel

Running P left-most threads does not exploit the locality of a serial program

Solution: Look at the cP left-most threads and “group” threads together for locality

Project Plan

Implement serial version of AsyncDF

Experiment with algorithms to group threads for locality

Performance comparison to Cilk

Different values of AsyncDF’s tunable parameter

Case studies

Recursive matrix multiplication

Strassen multiplication

N-body problem

Implement parallel version of AsyncDF

Performance comparison to Cilk

Additional experiments/modifications as research suggests

Questions?

Automatic conversion of NSP DAGs to SP DAGs

Sajindra Jayasena

"Objectives :"

Objectives :

Design of an efficient Algorithm to convert an arbitrary NSP DAG to SP DAG

- Correctness preserving

Identifying different graph topologies

Analysis of Space and Time complexities

Bound on increase in critical path length

Perform empirical analysis using cilk

Transactional Cilk

Language-Level Complex Transactions
C. Scott Ananian

An Evaluation of Nested Transactions

Sean Lie

sean_lie@mit.edu

6.895 Theory of Parallel Systems

Wednesday October 15th, 2003

Transactional Memory

The programmer is given the ability to define atomic regions of arbitrary size.

Start_Transaction;

Flow[i] = Flow[i] – X;

Flow[j] = Flow[j] + X;

End_Transaction;

With hardware support, different transactions can be executed concurrently when there are no memory conflicts.

Nested Transactions

Nested transactions can be handled by simply merging all inner transactions into the outermost transaction.

Concurrency

Merging transactions is simple but may decrease the program concurrency.

Nested Concurrent Transactions

Nested Concurrent Transactions (NCT):

Allow the inner transaction to run and complete independently from the outer transaction.

Questions:

When is NCT actually useful?

How much overhead is incurred by NCT?

How much do we gain from using NCT?

The tool needed to get the answers:

Transactional memory hardware simulator – UVSIM

UVSIM currently supports merging nested transactions.

How to get the answers:

Identify applications where NCT is useful. Look at malloc?

Implement necessary infrastructure to run NCTs in UVSIM.

Evaluate performance of NCT vs. merging transactions.

Compiler Support for Atomic Transactions in Cilk

Jim Sukha

Tushara Karunaratna

Determinacy Race Example

New Keywords in Cilk

Simulation of Transactional Memory

Work Involved

Modify the Cilk parser to accept xbegin and xend as keywords.

Identify all load and store operations in user code, and replace them with calls to functions from the runtime system.

Implement the runtime system. An initial implementation is to divide memory into blocks and to use a hash table to store a lock and backup values for each block.

Experiment with different runtime implementations.

Data Race in Transactional Cilk

Xie Yong

Background

Current Cilk: achieve atomicity using lock

problems such as priority inversion, deadlock, etc.

Nontrivial to code

Transactional memory

Software TM: overhead -> slow

Hardware TM

Cilk’s transaction everywhere

Every instruction becomes part of the a transaction

Based on HTM

Cilk transaction:

cut Cilk program into atomic pieces

Base on some language construct or compiler automatic generate

Data Race

Very different from traditional definition in Nondeterminator

Assumption (correct parallelization)

Definition (1^st half of Kai’s master thesis)

Efficient detection (v.s. NP-complete proved by Kai)

Make use of current algorithms/tools

Non-Determinacy Detection

Linear-time determinacy race detection

Jeremy Fineman

Motivation

Nondeterminator has two parts:

Check whether threads are logically parallel

Use shadow spaces to detect determinacy race

SP-Bags algorithm uses LCA lookup to determine whether two threads are parallel

LCA lookup with disjoint-sets data structure takes O(α(v,v)) time.

We do not care about the LCA. We just want to know if two threads are logically parallel.

New algorithm for determinacy-race detection

Similar to SP-Bags algorithm with two parts:

Check whether threads are logically parallel

Use shadow spaces to detect determinacy race

Use order maintenance data-structure to determine whether threads are parallel

Gain: Order maintenance operations are O(1) amortized time.

The algorithm

Maintain two orders:

Regular serial, depth-first, left-to right execution.

At each spawn, follow spawn thread before continuation thread

Right-to-left execution.

At each spawn, follow continuation thread before spawn thread

Claim: Two threads e₁ and e₂ are parallel if and only if e₁ precedes e₂ in one order, and e₂ precedes e₁ in the other.

Depth-first, left-to-right execution

Right-to-left execution

How to maintain both orders in serial execution

Keep two pieces of state for each procedure F:

C_F is current thread in F

S_F is next sync statement in F.

In both orders, insert new threads after current thread

On spawn, insert continuation thread before spawn thread in one order. Do the opposite in the other.

On sync, advance C_F to S_F

In any other thread, update shadow spaces based on current thread

Example

Project Proposal:

Parallel Nondeterminator

He Yuxiong

Objective

I propose Parallel Nondeterminator to:

Check the determinacy race in the parallel execution of the program written in the language like Cilk

Develop efficient algorithm to decide the concurrency between threads

Develop efficient algorithm to reduce the number of entries in access history

Primitive Idea in Concurrency Test

(1) Labeling Scheme

(2) Set operation

Thread representation (fid, tid)

"Two sets:"

Two sets:

Parallel set PS(f) ={(pf, ptid) | all threads with tid >= ptid in function pf is parallel with the running thread of f}

Children set CS(f)={fid | fid is the descendant of current function }

Operations:

Spawn: Thread Tx of function Fi spawn function Fj

Operations on child Fj

Operations on parent Fi

"Sync:"

Sync: Function Fi executes sync

PS(Fi)=PS(Fi)-CS(Fi)

Return: Function Fj returns to Function Fi

CS(Fi) = CS(Fi) + CS(Fj)

Release PS(Fj) and CS(Fj)

Concurrency Test:

Check if (fx, tx) is parallel with the current running thread (fc, tc):

Primitive Idea for Access History

Serial

read(l), write(l)

Simplest parallel program without nested parallelism

Two parallel read records, one write record

Language structure like Cilk

read: max level of parallelism, one write record

Q: Is it possible to keep only two read records for each shared location in Cilk Parallel Nondeterminator?

Keep two parallel read records with highest level of LCA in parent child spawn tree.

"Thank you very much!"

Thank you very much!

Parallel Nondeterminator
Wang Junqing

Using Cilk

Accelerating Multiprocessor Simulation

6.895 Project Proposal

Kenneth C. Barr

Introduction

Inspiration

UVSIM is slow

FlexSIM is cool

Interleave short periods of detailed simulation with long periods of functional warmup (eg., cache and predictor updates, but not out-of-order logic)

Achieve high accuracy in fraction of the time

Multi-configuration simulation is cool

Recognize structural properties. E.G., “contents of FA cache are subset of all larger FA caches with same line size” so search small->large. Once we hit is small cache, we can stop searching

Simulate many configurations with a single run

The Meta Directory

Combine previous ideas to speed multiprocessor simulation

Sequential Consistency is what matters, perhaps detailed consistency mechanisms can be interleaved with a fast, functional equivalent method

Meta-directory example

Detailed simulation:

P1 ld x ;null -> Sh, data sent to P1

P2 ld x ;Sh -> Sh, data sent to P2

P3 ld x ;Sh- > Sh data sent to P3

P2 st x ;Sh -> Ex, inv sent to P1 and P3, data sent to P2

P1 st x ;Ex -> Ex, inv sent to P2, data sent to P1

P2 ld x ;Ex -> Sh, data sent to P2

Shortcut

Record access in a small “meta-directory:”
x: {P1, 0, r}, {P2, 1, r}, {P3, 2, r}, {P2, 3, w}, {P1, 4, w}, {P2, 5, r}

All reads and writes occur in a memory; no messages sent or received, no directory modeled, no cache model in processor (?)

When it comes time for detailed simulation, we can reconstruct directory by scanning backwards: x is shared by P1 and P2.

Challenges

Accesses stored in circular queue. How many records needed for each address?

What happens when processor writes back data, current scheme introduces false hits.

Does this scheme always work? Some proofs would be nice.

Methodology

Platform

I got Simics to boot SMP Linux

But Bochs is open-source

X86 is key if this is to be long-term framework

Cilk benchmarks?

Tasks

Create detailed directory model

Create meta directory

Create a way to switch back and forth maintaining program correctness

Measure the dramatic improvement!

Conclusion

Reality

Sounds daunting for December deadline, but if I can prove feasibility or fatal flaws, I’d be excited

Suggestions?

Project Proposal: Parallelizing METIS

Zardosht Kasheff

What is METIS?

Graph Partitioning algorithm

Developed at University of Minnesota by George Karypis and Vipin Kumar

Information on METIS:

http://www-users.cs.umn.edu/~karypis/metis/

Stages of Algorithm: Coarsening

Task: Create sequentially smaller graphs that make good representation of original graph by collapsing connected nodes.

Issues:

Minor concurrency issues.

Maintaining data locality.

Writing large amount of data to memory in a scalable fashion.

Stages of Algorithm: Initial partitioning.

Task: partition small graph

Issues: none. Runtime of this portion of algorithm very small.

Stages of Algorithm: Uncoarsening and Refinement

Uncoarsening and Refinement: project partition of coarsest graph to original graph and refine partition.

Issues:

Major concurrency issues.

Remaining issues under research.

Parallel Sorting
Paul Youn

I propose to parallelize a sorting algorithm. Initially, looking at Quick Sort, but may investigate other sorting algorithms.

Sorting is a problem that can potentially be sped up significantly (mergesort).

As an alternative, considering other algorithms.

Goals

Similar to Lab 1 approach.

Speedy serial algorithm.

Basic parallel algorithm.

Runtime Analysis.

Empirical verification of runtime analysis.

Parallel speedups.

HELP!

Not sure about the scope of the problem.

Anyone out there interested in this stuff?

Other appropriate algorithms? Max-flow?

PROJECT PROPOSAL
SMA5509
Implement FIR filter in parallel computing using Cilk

Name: Pham Duc Minh

Matric: HT030502H

Email: g0300231@nus.edu.sg

phamducm@comp.nus.edu.sg

Discrete Signal

the discrete input signal is x(n)

the output signal is y(n)

the impulse response h(n)

FIR filter

Discrete time LTI (Linearity and Time Invariance) systems can be classified into FIR and IIR.

An FIR filter has impulse response h(n) that extends only over a finite time interval, say 0≤ n ≤ M, and is identically zero beyond that:

{h0, h1, h2 . . . , hM, 0, 0, 0 . . .}
M is referred to as the filter order.

The output y(n) in FIR is simplified to the finite-sum form:

or, explicitly

Methods to process

1. Block processing

a. Convolution

b. Direct Form

The length of the input signal x(n) is L

The length of output y(n) will be L_y = L_x+L_h-1

c. Matrix form

with the dimension of H is

"Overlap-Add Block Convolution Method"

Overlap-Add Block Convolution Method

above methods are not feasible in infinitive or extremely long applications

divide the long input into contiguous non-overlapping blocks of manageable length, say L samples

y0 = h*x0

y1 = h*x1

y2 = h*x2

Thus the resulting block y0 starts at absolute time n = 0

block y1 starts at n = L, and so on.

The last M samples of each output block will overlap with

the first M outputs of the next block .

"2."

2. Sampling Processing

The direct form I/O convolutional equation

We define the internal states w1(n), w2(2), w3(n)….

w0(n) = x(n)

w1(n) = x(n-1) = w0(n-1)

w2(n) = x(n-2) = w1(n-1)

…..

With this definition, we can y(n) in the form

"Goal of project"

Goal of project

Use Cilk to implement FIR filter in parallel

Compare with the processes of DSP kit (pipeline) and the process of FPGA (Field Programmable Gate Array) (also parallel).

"How to implement"

How to implement

Multithreaded programming

Cilk compiler

Other things

If I finish soon, I hope additional work suggestion from staff.

Cache-Oblivious Algorithms

SMA5509: Theory of Parallel Systems

Sriram Saroop

Advait D. Karande

bzip2 compression

Lossless text compression scheme

Burrows-Wheeler Transform (1994)

A pre-processing step for data compression

Involves sorting of all rotations of the block of data to be compressed

Rationale: Compressed string transforms better

Worst case complexity of N²log(N), where N = size of block to be sorted

Can be improved to N(logN)²(Sadakane ’98)

Why make it cache-oblivious?

Performance of BWT heavily dependent on cache behavior (Seward ’00)

Avoid slowdown for large files with high degree of repetitiveness

Especially useful in applications like bzip2 that are required to perform well on different memory hierarchies

Project Plan

Design and implementation of a cache oblivious algorithm for BWT

Performance comparisons with standard bzip2 for available benchmarks

Back-up: Optimizations for BWT including parallelization

New Cache-Oblivious Algorithms
Zhang Jiahui and Neel Kamal

Cache-Oblivious Lock Free Algorithms

Seth Gilbert

Cache-Oblivious Algorithms

Optimal memory access

Matrix multiply, Fast Fourier Transform

B-trees, Priority Queues

Concurrent?

Lock-free?

At least one thread can always make progress.

Packed Memory Structure

Operations:

Traverse(k) -- O(k/B)

Insert -- O(log²n/B)

Delete -- O(log²n/B)

Goal:

Design a lock-free version of the cache-oblivious algorithm

B-tree

Two algorithms

Based on packed memory structure

Based on alternative recursive construction and pointer-swinging


	Design and implement a dynamic processor-allocation system for adaptively parallel jobs (jobs for which the number of processors that can be used without waste varies during execution)
	The problem of allocating processors to adaptively parallel jobs is called the adaptively parallel processor-allocation problem [2].
	At any given time, each job j has a desire d_j and an allotment m_j


	We want to design a processor-allocation system that achieves a “fair” and “efficient” allocation among all jobs.
		“fair” means that whenever a job receives fewer processors than it desires, no other job receives more than one more processor than this job received [2]
		“efficient” means that no job receives more processors than it desires [2]


	Dynamically determine the desires of each job in the system
		Heuristics on steal rate
		Heuristics on number of visible stack frames
		Heuristics on number of threads in ready deques
	Dynamically determine the allotment for each job such that the resulting allocation is fair and efficient
		SRLBA algorithm [2]
		Cilk macroscheduler algorithm [3]
		Lottery scheduling techniques [5]


	All jobs on the system are Cilk jobs
	Jobs can enter and leave the system and change their parallelism during execution
	All jobs are mutually trusting (they will stay within the bound of their allotments and communicate their desires honestly)
	Each job has at least one processor to start with


	Supercomputing Technologies Group. Cilk 5.3.2 Reference Manual. MIT Lab for Computer Science, November 2001.
	B. Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998.
	R. D. Blumofe, C. E. Leiserson, and B. Song. Automatic processor allocation for work-stealing jobs.
	C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis, Massachusetts Institute of Technology, September 1995.
	C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In Proceedings of the First Symposium on Operating System Design and Implementation, pages 1-11. USENIX, November 1994.
	R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e±cient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207-216, Santa Barbara, California, July 1995.


		6.895 – Theory of Parallel Systems

		Project Presentation (Proposal)

		Alexandru Caracas


	A Cilk dag
	Numbers represent the serial execution of threads
	For example:
		Consider all threads perform file I/O operations


	Same Cilk dag
	Numbers represent the parallel execution of threads
	Goal:
		We would like to execute the threads in parallel while still allowing for the serial order to be reconstructed.


	Partitioning of the Cilk dag
	Colors represent the I/O operations done by a processor between two steal operations.


	The Serial-Append of the computation is obtained by reading the output in the order:
		orange (I)
		green (II)
		dark-red (III)
		purple (IV)
		red (V)


	Efficient serial-append algorithm
		possibly using B+-trees
	Analyze possible implementation:
		kernel level
		formatted file
		Cilk file system
	Implementation
		(one of the above)
	Port Cheerio to Cilk 5.4.
	Comparison with own implementation
	Port serial applications to use serial-append (analyze their performance)


	Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356-368, Santa Fe, New Mexico, November 1994.
	Matthew S. DeBergalis. A parallel file I/O API for Cilk. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 2000.
	Supercomputing Technology Group MIT Laboratory for Computer Science. Cilk 5.3.2 Reference Manual, November 2001. Available at http://supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf.


	Implement a new scheduler for Cilk based on Narlikar and Blelloch’s Space-Efficient Scheduling of Nested Parallelism paper


	Idea: minimize peak memory usage by preempting threads that try to allocate large blocks of memory

	Algorithm: Roughly, AsyncDF runs the P left-most threads in the computation DAG


	Cilk
		Distributed
		Run threads like a serial program until a processor runs out of work and has to steal
		Space: p*S₁
	AsyncDF
		Global
		Preempt a thread when it tries to allocate a large block of memory and replace it with a computationally heavy thread
		Space: S₁ + O(KDp)


	Two versions of scheduler
		Serial
		Parallel
	Running P left-most threads does not exploit the locality of a serial program
		Solution: Look at the cP left-most threads and “group” threads together for locality


Implement serial version of AsyncDF
	Experiment with algorithms to group threads for locality
	Performance comparison to Cilk
	Different values of AsyncDF’s tunable parameter
	Case studies
		Recursive matrix multiplication
		Strassen multiplication
		N-body problem
Implement parallel version of AsyncDF
	Performance comparison to Cilk
Additional experiments/modifications as research suggests


	Objectives :
	Design of an efficient Algorithm to convert an arbitrary NSP DAG to SP DAG
		- Correctness preserving
		Identifying different graph topologies

	Analysis of Space and Time complexities
		Bound on increase in critical path length

	Perform empirical analysis using cilk


	Sean Lie
	sean_lie@mit.edu

	6.895 Theory of Parallel Systems
	Wednesday October 15th, 2003


	The programmer is given the ability to define atomic regions of arbitrary size.

	Start_Transaction;
	Flow[i] = Flow[i] – X;
	Flow[j] = Flow[j] + X;
	End_Transaction;

	With hardware support, different transactions can be executed concurrently when there are no memory conflicts.


	Nested transactions can be handled by simply merging all inner transactions into the outermost transaction.


	Merging transactions is simple but may decrease the program concurrency.


	Nested Concurrent Transactions (NCT):
		Allow the inner transaction to run and complete independently from the outer transaction.
	Questions:
		When is NCT actually useful?
		How much overhead is incurred by NCT?
		How much do we gain from using NCT?
	The tool needed to get the answers:
		Transactional memory hardware simulator – UVSIM
		UVSIM currently supports merging nested transactions.
	How to get the answers:
		Identify applications where NCT is useful. Look at malloc?
		Implement necessary infrastructure to run NCTs in UVSIM.
		Evaluate performance of NCT vs. merging transactions.


	Modify the Cilk parser to accept xbegin and xend as keywords.
	Identify all load and store operations in user code, and replace them with calls to functions from the runtime system.
	Implement the runtime system. An initial implementation is to divide memory into blocks and to use a hash table to store a lock and backup values for each block.
	Experiment with different runtime implementations.


	Current Cilk: achieve atomicity using lock
		problems such as priority inversion, deadlock, etc.
		Nontrivial to code
	Transactional memory
		Software TM: overhead -> slow
		Hardware TM


	Every instruction becomes part of the a transaction
	Based on HTM
	Cilk transaction:
		cut Cilk program into atomic pieces
		Base on some language construct or compiler automatic generate


	Very different from traditional definition in Nondeterminator
	Assumption (correct parallelization)
	Definition (1^st half of Kai’s master thesis)
		Efficient detection (v.s. NP-complete proved by Kai)
		Make use of current algorithms/tools


	Nondeterminator has two parts:
		Check whether threads are logically parallel
		Use shadow spaces to detect determinacy race
	SP-Bags algorithm uses LCA lookup to determine whether two threads are parallel
	LCA lookup with disjoint-sets data structure takes O(α(v,v)) time.
	We do not care about the LCA. We just want to know if two threads are logically parallel.


	Similar to SP-Bags algorithm with two parts:
		Check whether threads are logically parallel
		Use shadow spaces to detect determinacy race
	Use order maintenance data-structure to determine whether threads are parallel
	Gain: Order maintenance operations are O(1) amortized time.


Maintain two orders:
	Regular serial, depth-first, left-to right execution.
		At each spawn, follow spawn thread before continuation thread
	Right-to-left execution.
		At each spawn, follow continuation thread before spawn thread
Claim: Two threads e₁ and e₂ are parallel if and only if e₁ precedes e₂ in one order, and e₂ precedes e₁ in the other.