

# Multipliers & Pipelining

- Combinational multiplier
- Two's complement multiplier
- Smaller multipliers, faster multipliers
- Latency & Throughput
- Pipelining to increase throughput
- Retiming

#### Lab #3 due tonight, report next Tuesday, no LPSets this week

#### **Unsigned Multiplication**



Multiplying N-bit number by M-bit number gives (N+M)-bit result

Easy part: forming partial products (just an AND gate since  $B_I$  is either 0 or 1) Hard part: adding M N-bit partial products

## **Combinational Multiplier (unsigned)**



### **Combinational Multiplier (signed!)**



6.111 Fall 2008

#### 2's Complement Multiplication (Baugh-Wooley)

Step 1: two's complement operands so high order bit is  $-2^{N-1}$ . Must sign extend partial products and subtract the last one

|             |                              |                              |                              | ,                            | X3<br>* Y3                                | X2<br>Y2             | X1<br>Y1     | X0<br>Y0 |
|-------------|------------------------------|------------------------------|------------------------------|------------------------------|-------------------------------------------|----------------------|--------------|----------|
| +<br>+<br>- | X3Y0<br>X3Y1<br>X3Y2<br>X3Y3 | X3Y0<br>X3Y1<br>X3Y2<br>X3Y3 | X3Y0<br>X3Y1<br>X3Y2<br>X2Y3 | X3Y0<br>X3Y1<br>X2Y2<br>X1Y3 | <mark>X3Y0</mark><br>X2Y1<br>X1Y2<br>X0Y3 | X2Y0<br>X1Y1<br>X0Y2 | X1Y0<br>X0Y1 | X0Y0     |
|             | <br>z7                       | <br>Z6                       | <br>z5                       | <br>Z4                       | <br>Z3                                    | <br>Z2               | <br>z1       | <br>z0   |

Step 2: don't want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction into add of (complement + 1).



Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away!

|   |      |              |             | <b>X3Y</b> 0 | X2Y0 | X1Y0 | X0Y0 |
|---|------|--------------|-------------|--------------|------|------|------|
| + |      |              | X3Y1        | X2Y1         | X1Y1 | X0Y1 |      |
| + |      | <b>X2Y2</b>  | <u>x1Y2</u> | <u>x0y2</u>  |      |      |      |
| + | X3X3 | <u>x2</u> y3 | <b>X1Y3</b> | <b>X0Y3</b>  |      |      |      |
| + |      |              |             | 1            |      |      |      |
| - | 1    | 1            | 1           | 1            |      |      |      |

Step 4: finish computing the constants...

|   |   |      |             |             | <b>X3Y</b> 0 | X2Y0 | X1Y0 | X0Y0 |
|---|---|------|-------------|-------------|--------------|------|------|------|
| + |   |      |             | X3Y1        | X2Y1         | X1Y1 | X0Y1 |      |
| + |   |      | <u>x2Y2</u> | <u>x1Y2</u> | <u>x0y2</u>  |      |      |      |
| + |   | X3Y3 | X2Y3        | X1Y3        | X0Y3         |      |      |      |
| + | 1 |      |             | 1           |              |      |      |      |

Result: multiplying 2's complement operands takes just about same amount of hardware as multiplying unsigned operands!

#### 2's Complement Multiplication



## **Multiplication in Verilog**

You can use the "\*" operator to multiply two numbers:

```
wire [9:0] a,b;
wire [19:0] result = a*b; // unsigned multiplication!
```

If you want Verilog to treat your operands as signed two's complement numbers, add the keyword signed to your wire or reg declaration:

```
wire signed [9:0] a,b;
wire signed [19:0] result = a*b; // signed multiplication!
```

Remember: unlike addition and subtraction, you need different circuitry if your multiplication operands are signed vs. unsigned. Same is true of the >>> (arithmetic right shift) operator. To get signed operations all operands must be signed.

```
To make a signed constant: 10'sh37C
```

## Multiplication on the FPGA

Hardware multiplier block: two 18-bit twos complement (signed) operands



In the XC2V6000: 6 columns of mults, 24 in each column = 144 mults



### Sequential Multiplier

Assume the multiplicand (A) has N bits and the multiplier (B) has M bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single partial product at a time and then cycle the circuit M times:



### **Bit-Serial Multiplication**



```
Init: P = 0; Load A,B
Repeat M times {
   Repeat N times {
      shift A,P:
      Amsb = Alsb
      Pmsb = Plsb + Alsb*Blsb + C/0
    }
   shift P,B: Pmsb = C, Bmsb = Plsb
}
(N+M)-bit result in P/B
```

## Useful building block: Carry-Save Adder



## Wallace Tree Multiplier



building blocks.

O(log<sub>1.5</sub>M)

6.111 Fall 2008

Lecture 9

## Multiplication by a constant

- If one of the operands is a constant, make it the multiplier (B in the earlier examples). For each "1" bit in the constant we get a partial product (PP) may be noticeably fewer PPs than in the general case.
  - For example, in general multiplying two 4-bit operands generates four PPs (requiring 3 rows of full adders). If the multiplier is say, 12 (4'b1100), then there are only two PPs: 8\*A+4\*A (requiring only 1 row of full adders).
  - But lots of "1"s means lots of PPs... can we improve on this?
- If we allow ourselves to subtract PPs as well as adding them (the hardware cost is virtually the same), we can re-encode arbitrarily long contiguous runs of "1" bits in the multiplier to produce just two PPs.

...011110... = ...100000... - ...000010... = ...0100010...

where 1 indicates subtracting a PP instead of adding it. Thus we've re -encoded the multiplier using 1,0,-1 digits - aka *canonical signed digit* greatly reducing the number of additions required.

## Booth Recoding: Higher-radix mult.

Idea: If we could use, say, 2 bits of the multiplier in generating each partial product we would halve the number of rows and halve the latency of the multiplier!



## Booth recoding

On-the-fly canonical signed digit encoding!

current bit pair,

/from previous bit pair

| B <sub>K+1</sub> | B <sub>K</sub> | B <sub>K-1</sub> | action  |          |
|------------------|----------------|------------------|---------|----------|
| 0                | 0              | 0                | add O   |          |
| 0                | 0              | 1                | add A   |          |
| 0                | 1              | 0                | add A   |          |
| 0                | 1              | 1                | add 2*A |          |
| 1                | 0              | 0                | sub 2*A |          |
| 1                | 0              | 1                | sub A   | ← -2*A+A |
| 1                | 1              | 0                | sub A   |          |
| 1                | 1              | 1                | add O   | ← -A+A   |
|                  |                |                  | -       |          |

A "1" in this bit means the previous stage needed to add 4\*A. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4\*A in the previous stage is like adding A in this stage!

#### Sequential Divider

Assume the Dividend (A) and the divisor (B) have N bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single subtraction each cycle and then cycle the circuit N times. This circuit works on unsigned operands; for signed operands one can remember the signs, make operands positive, then correct sign of result.



### Performance Metrics for Circuits

Circuit Latency (L):

time between arrival of new input and generation of corresponding output.

For combinational circuits this is just  $t_{PD}$ .

Circuit Throughput (T):

Rate at which new outputs appear.

For combinational circuits this is just  $1/t_{PD}$  or 1/L.

## Performance of Combinational Circuits



For combinational logic:  $L = t_{PD},$  $T = 1/t_{PD}.$ 

We can't get the answer faster, but are we making effective use of our hardware at all times?



F & G are "idle", just holding their outputs stable while H performs its computation

## Retiming: A very useful transform

Registers have to be moved from ALL inputs to ALL outputs or vice versa



Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of the edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa.



• Reduce total number of registers

Retiming Synchronous Circuitry Charles E. Leiserson and James B. Saxe

August 20, 1986

#### Retiming Combinational Circuits aka "Pipelining"



### Pipeline diagrams



The results associated with a particular set of input data moves *diagonally* through the diagram, progressing through one pipeline stage each clock cycle.

## **Pipeline Conventions**

DEFINITION:

a *K-Stage Pipeline* ("K-pipeline") is an acyclic circuit having exactly K registers on *every* path from an input to an output.

a COMBINATIONAL CIRCUIT is thus an O-stage pipeline.

CONVENTION:

Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input).

ALWAYS:

The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register  $t_{PD}$  PLUS (output) register  $t_{SETUP}$ .

The LATENCY of a K-pipeline is K times the period of the clock common to all registers.

The THROUGHPUT of a K-pipeline is the frequency of the clock.

## **Ill-formed pipelines**

Consider a BAD job of pipelining:



For what value of K is the following circuit a K-Pipeline? \_\_\_\_\_\_

Problem:

Successive inputs get mixed: e.g.,  $B(A(X_{i+1}), Y_i)$ . This happened because some paths from inputs to outputs have 2 registers, and some have only 1!

This CAN'T HAPPEN on a well-formed K pipeline!

## A pipelining methodology

#### Step 1:

Add a register on each output.

#### Step 2:

Add another register on each output. Draw a cut-set contour that includes all the new registers and some part of the circuit. Retime by moving regs from all outputs to all inputs of cut-set.

Repeat until satisfied with T.

#### STRATEGY:

Focus your attention on placing pipelining registers around the slowest circuit elements (BOTTLENECKS).



## Pipeline Example



|         | LATENCY | THROUGHPUT |
|---------|---------|------------|
| 0-pipe: | 4       | 1/4        |
| 1-pipe: | 4       | 1/4        |
| 2-pipe: | 4       | 1/2        |
| 3-pipe: | 6       | 1/2        |

OBSERVATIONS:

- 1-pipeline improves neither L or T.
- T improved by breaking long combinational paths, allowing faster clock.
- Too many stages cost L, don't improve T.
- Back-to-back registers are often required to keep pipeline well -formed.

## Increasing Throughput: Pipelining



Throughput =  $1/4t_{PD,FA}$  instead of  $1/8t_{PD,FA}$ )

How about  $t_{PD} = 1/2t_{PD,FA}$ ?

