# LECTURE 17 COARSE-GRAINED RECONFIGURABLE COMPUTING

JOEL EMER AND DANIEL SANCHEZ

6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE Spring 2013





## Dyser



## Coarse-grained Block Diagram



# Performance/Area



# Sorting Network



## Merge Sort

```
if (listA > listB ||
                             (listA.finished && !listB.finished))
                            send(listB);
for x = 1..NPASSES
                       else if (!listA.finished)
    for y = 1..k
                            send(listA);
     // control loop
                                22
  PE
                                              14
               32 83
                        PE
                                       PE
                                                               PE
                         \bigcirc
                                                                \circ
                         9
```

## Merge sort worker – PC+RegQueue

```
check_a: beqz
              %in0.notEmpty, check_a // listA
check_b: begz
              %in1.notEmpty, check_b // listB
              %out0.notFull, check_o // outList
check_o: beqz
        beq
              %in0.tag, EOL, a_done
               %in1.tag, EOL send_a
        cmp.lt %r0, %in0.first, %in1.first
               %r0, send_a
        bnez
               %out0, %in1.first
send_b:
        enq
               %in1
        deq
               check_a
        jump
               %out0, %in0.first
send_a:
        enq
               %in0
        deq
               check_a
        jump
               %in1.first, EOL, done
a_done:
        beq
               send b
        jump
        deq
               %in0
done:
               %in1
        deq
        return;
```

| Static Insts           | 18 |
|------------------------|----|
| Avg Insts/Iteration    | 10 |
| Avg Branches/Iteration | 7  |

# PC-based design alternatives

| Feature       | Description                                                                |
|---------------|----------------------------------------------------------------------------|
| PC (Baseline) | PEs use program counters, communicate using shared-memory queues.          |
| +RegQueue     | Expose register-mapped queues to ISA, test via active polling.             |
| +FusedDeq     | Destructive read of queue registers without separate instructions.         |
| +RegQSelect   | Allow indirect jump based on register queue status bits.                   |
| +RegQStall    | Issue stalls on queue input/output registers without special instructions. |
| +QMultiThread | Stalling on empty/full queue yields thread.                                |
| +Predication  | Predicate registers that can be set using queue status bits.               |
| +Augmented    | ISA augmented with all of the above features except $+$ QMultiThread.      |

## Merge sort worker – PC+Augmented

```
%in0.tag, EOL, a_done
start:
             beq
             beq
                    %in1.tag, EOL, send_a
             cmp.ge p2, in0.first, in1.first
send_b: (p2) enq %out0, %in1.first (deq %in1)
send_a: (!p2)
             enq %out0, in0.first (deq %in0)
             jump
                   start
             cmp.ne p2, %in1.first, EOL
a_done:
        (p2) jump send_b
                    (deq %in0, deq %in1)
             nop
             return;
```

| Static Insts                    | 9    |
|---------------------------------|------|
| Avg Insts/Iteration (Issued)    | 6    |
| Avg Insts/Iteration (Committed) | 5    |
| Avg Branches/Iteration          | 3    |
| Speedup vs PC+RegQueue (Fig 3)  | 1.4× |

#### **Guarded Actions**

reg B; reg C;

```
rule X (A > 0 && B != C)
{
    A <= B + 1;
    B <= B - 1;
    C <= B * A;
}</pre>
```

rule Y (...) {...}

rule Z (...) {...}

Scheduler

- Program consists of rules that may perform computations and read/write state
- Each rule specifies conditions (guard) under which it is allowed to fire
- Separates description and execution of data (rule body) from control (guards)
- A scheduler is generated (or provided by hardware) that evaluates the guards and schedules rule execution
- Sources of Parallelism
  - Intra-Rule parallelism
  - Inter-Rule parallelism
  - Scheduler overlap with Rule execution
  - Parallel access to state

## Merge sort worker – Trigger Inst

```
doCheck:
   when (!p0 && %in0.tag != EOL
              && %in1.tag != EOL) do
        cmp.ge p1, %in0.data, %in1.data (p0 := 1)
sendA:
   when (p0 && p1) do
        enq %out0, %in0.data (deq %in0, p0 := 0)
sendB:
   when (p0 && !p1) do
        enq %out0, %in1.data (deq %in1, p0 := 0)
drainA:
   when (%in0.tag != EOL && %in1.tag == EOL) do
        enq %out0, %in0.data (deq %in0)
drainB:
   when (%in0.tag == EOL && %in1.tag != EOL) do
        enq %out0, %in1.data (deq %in1)
bothDone:
   when (%in0.tag == EOL && %in1.tag == EOL) do
        nop (deq %in0, deq %in1)
```

| Static Insts                    | 6          |
|---------------------------------|------------|
| Avg Insts/Iteration             | 2          |
| Speedup vs PC+RegQueue (Fig 3)  | $5 \times$ |
| Speedup vs PC+Augmented (Fig 4) | $3 \times$ |

## PE Block Diagram



## Scheduler



#### Static Instructions



## **Dynamic Instructions**



### Control Instructions Evaluation

