## Arithmetic Circuits

Didn't I learn how to do addition in the second grade? MIT courses aren't what they used to be...


## 01011 +00101 10000

Acknowledgements:

- R. Katz, "Contemporary Logic Design", Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5)
- J. Rabaey, A. Chandrakasan, B. Nikolic, "Digital Integrated Circuits: A Design Perspective" Prentice Hall, 2003.
- Kevin Atkinson, Alice Wang, Rex Min


## Number Systems Basics

## How to represent negative numbers?

- Three common schemes: sign-magnitude, ones complement, twos complement
- Sign-magnitude: MSB = 0 for positive, 1 for negative
- Range: -( $\left.2^{\mathrm{N}-1}-1\right)$ to $+\left(2^{\mathrm{N}-1}-1\right)$
- Two representations for zero: 0000... \& 1000...
- Simple multiplication but complicated addition/subtraction
- Ones complement: if $\mathbf{N}$ is positive then its negative is $\overline{\mathbf{N}}$
- Example: $0111=7,1000=-7$
- Range: -( $\left.2^{\mathrm{N}-1}-1\right)$ to $+\left(2^{\mathrm{N}-1}-1\right)$
- Two representations for zero: 0000... \& 1111...
- Subtraction implemented as addition followed by ones complement


## 2's Complement



8-bit 2's complement example:

$$
11010110=-2^{7}+2^{6}+2^{4}+2^{2}+2^{1}=-128+64+16+4+2=-42
$$

If we use a two's-complement representation for signed integers, the same binary addition procedure will work for adding both signed and unsigned numbers.

By moving the implicit location of "decimal" point, we can represent fractions too:

$$
1101.0110=-2^{3}+2^{2}+2^{0}+2^{-2}+2^{-3}=-8+4+1+0.25+0.125=-2.25
$$

## Twos Complement Representation

## Twos complement $=$ bitwise complement +1

$0111 \rightarrow \mathbf{1 0 0 0}+1=1001=-7$
$1001 \rightarrow \mathbf{0 1 1 0}+\mathbf{1}=0111=7$

- Asymmetric range: $-\mathbf{2}^{\mathrm{N}-1}$ to $+\mathbf{2}^{\mathrm{N}-1}-1$
- Only one representation for zero
- Simple addition and subtraction
- Most common representation


| 4 | 0100 |  |  |  |  |  |  |
| ---: | :---: | :---: | :---: | :---: | :---: | ---: | :---: |
| +3 | $\frac{0011}{7}$ | +4 | 1100 | 4 | 0100 | -4 | 1100 |

## Binary Addition

Here's an example of binary addition as one might do it by "hand":


So we can quickly build a circuit two add two 4-bit numbers...


## Subtraction: $A-B=A+(-B)$

Using 2's complement representation: $-B=\sim B+1$

So let's build an arithmetic unit that does both addition and subtraction. Operation selected by control input:


## Condition Codes

Besides the sum, one often wants four other bits of information from an arithmetic unit:
Z (zero): result is $=0 \quad$ big NOR gate
N (negative): result is < $0 \quad S_{N-1}$
C (carry): indicates that add in the most
significant position produced a carry, e.g.,
"1 + (-1)" from last FA
$V$ (overflow): indicates that the answer has
too many bits to be represented correctly by
the result width, e.g., "(2 $\left.\overline{2}^{N-1}-1\right)+\left(2^{N-1}-1\right) "$
$V=A_{N-1} B_{N-1} \overline{S_{N-1}}+\overline{A_{N-1}}{ }^{B_{N-1}} S_{N-1}$
$V=C O U T_{N-1} \oplus C I N_{N-1}$

To compare $A$ and $B$, perform $A-B$ and use condition codes:

Signed comparison: LT $\quad \mathrm{N} \oplus \mathrm{V}$ LE $\quad \mathbf{Z}+(\mathbf{N} \oplus \mathbf{V})$ EQ Z NE ~Z GE $\quad \sim(N \oplus V)$ GT $\quad \sim(Z+(N \oplus V))$

Unsigned comparison:

| LTU | $C$ |
| :--- | :--- |
| LEU | $\mathrm{C}+\mathrm{Z}$ |
| GEU | $\sim \mathrm{C}$ |
| GTU | $\sim(\mathrm{C}+\mathrm{Z})$ |

## $t_{\text {PD }}$ of Ripple-carry Adder



Worse-case path: carry propagation from LSB to MSB, e.g., when adding 11... 111 to 00... 001 .

$$
t_{P D}=(N-1) * \underbrace{t_{P D, O R}+t_{P D, A N D}}_{C l \text { to } C O})+\underbrace{t_{P D, X O R}}_{C_{N-1} \text { to } S_{N-1}} \approx \Theta(N)
$$

$\Theta(N)$ is read "order $N$ " and tells us that the latency of our adder grows proportional to the number of bits in the operands.

## Faster carry logic

Let's see if we can improve the speed by rewriting the equations for $C_{\text {out }}$ :

$$
\begin{aligned}
C_{\text {OUT }} & =A B+A C_{\mathbb{I N}}+B C_{\mathbb{I N}} \\
& =A B+(A+B) C_{\mathbb{I N}} \\
& =G+P C_{\mathbb{I N}} \quad \text { where } G=A B \text { and } P=A+B \\
& \text { generate propagate }
\end{aligned}
$$

For adding two $N$-bit numbers:

$$
\begin{aligned}
C_{N} & =G_{\mathrm{N}-1}+P_{\mathrm{N}-1} C_{\mathrm{N}-1} \\
& =G_{\mathrm{N}-1}+P_{\mathrm{N}-1} G_{\mathrm{N}-2}+P_{\mathrm{N}-1} P_{\mathrm{N}-2} C_{\mathrm{N}-2} \\
& =G_{\mathrm{N}-1}+P_{\mathrm{N}-1} G_{\mathrm{N}-2}+P_{\mathrm{N}-1} P_{\mathrm{N}-2} G_{\mathrm{N}-3}+\ldots+P_{\mathrm{N}-1} \ldots P_{0} C_{I N}
\end{aligned}
$$

$$
C_{N} \text { in only } 3 \text { (!) gate delays: }
$$

1 for PIG generation, 1 for ANDs, 1 for final OR

## Carry Bypass Adder



Key Idea: if $\left(\mathbf{P}_{\mathbf{0}} \mathbf{P}_{\mathbf{1}} \mathbf{P}_{\mathbf{2}} \mathbf{P}_{\mathbf{3}}\right)$ then $\mathbf{C}_{\mathbf{0}, \mathbf{3}}=\mathbf{C}_{\mathbf{i}, \mathbf{0}}$

## 16-bit Carry Bypass Adder



Assume the following for delay each gate:
$P$, $G$ from $A, B: 1$ delay unit
$P, G, C_{i}$ to $C_{0}$ or Sum for a FA: 1 delay unit
2:1 mux delay: 1 delay unit

What is the worst case propagation delay for the 16-bit adder?

## Critical Path Analysis



For the second stage, is the critical path:
$B P 2=0$ or $B P 2=1$ ?
Message: Timing Analysis is Very Tricky Must Carefully Consider Data Dependencies For False Paths

## Carry-lookahead Adders (CLA)

We can choose the maximum fan-in we want for our logic gates and then build a hierarchical carry chain using these equations:

$$
\begin{aligned}
& C_{J+1}=G_{I J}+P_{\mathrm{IJ}} C_{I} \\
& G_{\mathrm{IK}}=G_{\mathrm{J}+1, \mathrm{~K}}+P_{\mathrm{J}+1, \mathrm{~K}} G_{\mathrm{IJ}} \\
& P_{\mathrm{IK}}=P_{\mathrm{IJ}} P_{\mathrm{J}+1, \mathrm{~K}}
\end{aligned}
$$

where $I \leqslant J$ and $J+1 \leqslant K$
"generate a carry from bits I thru $K$ if it is generated in the high-order ( $J+1, K$ ) part of the block or if it is generated in the low-order ( $I, J$ ) part of the block and then propagated thru the high part"


Hierarchical building block


## 8-bit CLA (P/G generation)



## 8-bit CLA (carry generation)



## 8-bit CLA (complete)



## Unsigned Multiplication

$$
\begin{aligned}
& \begin{array}{rlll}
A_{3} & A_{2} & A_{1} & A_{0} \\
\times B_{3} & B_{2} & B_{1} & B_{0} \\
\hline
\end{array} \\
& A B_{i} \text { called a "partial product" } \longrightarrow A_{3} B_{0} A_{2} B_{0} A_{1} B_{0} A_{0} B_{0} \\
& A_{3} B_{1} \quad A_{2} B_{1} \quad A_{1} B_{1} \quad A_{0} B_{1} \\
& A_{3} B_{2} \quad A_{2} B_{2} \quad A_{1} B_{2} \quad A_{0} B_{2} \\
& +A_{3} B_{3} A_{2} B_{3} A_{1} B_{3} A_{0} B_{3}
\end{aligned}
$$

Multiplying $N$-bit number by $M$-bit number gives $(N+M)$-bit result
Easy part: forming partial products
(just an AND gate since $B_{I}$ is either 0 or 1)
Hard part: adding M N-bit partial products

## Sequential Multiplier

Assume the multiplicand ( $A$ ) has $N$ bits and the multiplier (B) has $M$ bits. If we only want to invest in a single $N$-bit adder, we can build a sequential circuit that processes a single partial product at a time and then cycle the circuit $M$ times:


```
Init: P\leftarrow0, load A and B
Repeat M times {
    P}\leftarrowP+(\mp@subsup{B}{LSB}{\prime==1 ? A : 0)
    shift P/B right one bit
}
Done: (N+M)-bit result in P/B
```


## Combinational Multiplier



## 2's Complement Multiplication

Step 1: two's complement operands so high order bit is $-2^{\mathrm{N}-1}$. Must sign extend partial products and subtract the last one


Step 2: don't want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction in add of (complement + 1).

```
    X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0
+
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1
+ 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+
+ \overline{X3Y3 }\overline{\textrm{X}3\textrm{Y}3}\overline{\textrm{X}2\textrm{Y}3}\overline{\textrm{X}1\textrm{Y}3}\overline{\textrm{X}0Y3}
+ 1
- 1 1 1 1 1
```

Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away!

|  |  |  |  | $\overline{\mathrm{X} 3 \mathrm{Y} 0}$ | X2Y0 | X1Y0 | X0Y0 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| + |  |  | $\overline{\mathrm{X} 3 \mathrm{Y} 1}$ | X2Y1 | X1Y1 | X0Y1 |  |
| + |  | $\overline{\mathrm{X2Y2}}$ | X1Y2 | X0Y2 |  |  |  |
| + | X3Y3 | $\overline{\mathrm{X2Y}}$ | $\overline{\mathrm{X1Y}}$ | $\overline{\mathrm{X0Y3}}$ |  |  |  |
| + |  |  |  | 1 |  |  |  |
| - | 1 | 1 | 1 | 1 |  |  |  |

Step 4: finish computing the constants...


Result: multiplying 2's complement operands takes just about same amount of hardware as multiplying unsigned operands!

## 2's Complement Multiplication



## Carry-Save Adder (CSA)

Good for pipelining: delay through each partial product (except the last) is just tPD, AND + +PD,FA. No


## Latency Improvements

Abstract partial product picture :


Rewire so that first two adders work in parallel. Feed results into third and fourth adders which also work in parallel, etc.


Even and odd streams pass through half the adders so even/odd design runs at almost twice the speed of simple implementation.

## More Latency Improvements



## Higher-radix multiplication

Idea: If we could use, say, 2 bits of the multiplier in generating each partial product we would halve the number of columns and halve the latency of the multiplier!

| $A_{N-1}$ | $A_{N-2}$ | $\ldots$ | $A_{4}$ | $A_{3}$ | $A_{2}$ | $A_{1}$ | $A_{0}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $\times$ | $B_{M-1}$ | $B_{M-2}$ | $\cdots$ | $B_{3}$ | $B_{2}$ | $B_{1}$ | $B_{0}$ |

## Booth recoding

current bit pair

| from previous bit pair |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $B_{K+1}$ | $B_{K}$ | $B_{K-1}$ | action |  |
| 0 | 0 | 0 | add 0 |  |
| 0 | 0 | 1 | add $A$ |  |
| 0 | 1 | 0 | add $A$ |  |
| 0 | 1 | 1 | add $2^{\star} A$ |  |
| 1 | 0 | 0 | sub $2^{\star} A$ |  |
| 1 | 0 | 1 | sub $A$ | $\leftarrow-2^{\star} A+A$ |
| 1 | 1 | 0 | sub $A$ | $\leftarrow-A+A$ |
| 1 | 1 | 1 | add 0 |  |

A "1" in this bit means the previous stage needed to add $4^{*} A$. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4* $A$ in the previous stage is like adding $A$ in this stage!

## Behavioral Transformations

- There are a large number of implementations of the same functionality
- These implementations present a different point in the area-time-power design space
- Behavioral transformations allow exploring the design space a high-level


## Optimization metrics:

1. Area of the design
2. Throughput or sample time $T_{S}$
3. Latency: clock cycles between the input and associated output change
4. Power consumption
5. Energy of executing a task

6. ...

## Fixed-Coefficient Multiplication

Conventional Multiplication

$$
\begin{aligned}
& \mathbf{Z}=\mathbf{X} \cdot \mathbf{Y} \\
& \mathrm{X}_{3} \cdot \mathrm{Y}_{1} \quad \mathrm{X}_{2} \cdot \mathrm{Y}_{1} \quad \mathrm{X}_{1} \cdot \mathrm{Y}_{1} \quad \mathrm{X}_{0} \cdot \mathrm{Y}_{1} \\
& \mathrm{X}_{3} \cdot \mathrm{Y}_{2} \quad \mathrm{X}_{2} \cdot \mathrm{Y}_{2} \quad \mathrm{X}_{1} \cdot \mathrm{Y}_{2} \quad \mathrm{X}_{0} \cdot \mathrm{Y}_{2} \\
&
\end{aligned}
$$

Constant multiplication (become hardwired shifts and adds)


## Transform: Canonical Signed Digits (CSD)

Canonical signed digit representation is used to increase the number of zeros. It uses digits $\{-1,0,1\}$ instead of only $\{0,1\}$.

Iterative encoding: replace string of consecutive 1's

$$
\left.\begin{array}{|llllll|l|l|l|l|llllll|}
\hline 0 & 1 & 1 & \ldots & 1 & 1
\end{array}\right] \begin{array}{|lllllll}
\hline 1 & 0 & 0 & \ldots & 0 & - \\
\hline
\end{array}
$$

Worst case CSD has 50\% non zero bits


## Algebraic Transformations

$\xrightarrow[A+B=B+A]{\text { Associativity }}$

Distributivity

$(A+B) C=A B+B C$
Common sub-expressions


## Transforms for Efficient Resource Utilization


distributivity

Time multiplexing: mapped to 3 multipliers and 3 adders


Reduce number of operators to 2 multipliers and 2 adders


Lectures 9/10, Slide 31

## Retiming: A very useful transform

## Retiming is the action of moving delay around in the systems

- Delays have to be moved from ALL inputs to ALL outputs or vice versa


Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa.

Benefits of retiming:

- Modify critical path delay
- Reduce total number of registers


Lectures 9/10, Slide 32

## Retiming Example: FIR Filter



Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate.

## Pipelining $=$ Adding Registers + Retiming



## The Power of Transforms: Lookahead



