## Lecture 14 GPUS

### DANIEL SANCHEZ AND JOEL EMER

[INCORPORATES MATERIAL FROM KOZYRAKIS (EE382A), NVIDIA KEPLER WHITEPAPER, HENNESY&PATTERSON]

6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013



Massachusetts Institute of Technology

## Today's Menu

- Review of vector processors
- Basic GPU architecture
- Paper discussions

### **Vector Processors**



Scalar processors operate on single numbers (scalars)

 Vector processors operate on linear sequences of numbers (vectors)

### What's in a Vector Processor?

- □ A scalar processor (e.g. a MIPS processor)
  - Scalar register file (32 registers)
  - Scalar functional units (arithmetic, load/store, etc)
- A vector register file (a 2D register array)
  - Each register is an array of elements
    - E.g. 32 registers with 32 64-bit elements per register
  - MVL = maximum vector length = max # of elements per register
- A set of vector functional units
  - Integer, FP, load/store, etc
  - Some times vector and scalar units are combined (share ALUs)

## Example of Simple Vector Processor



### **Basic Vector ISA**

| <u>Instr</u> .  | <u>Operands</u>   | <u>Operation</u>                        | <u>Comment</u>     |
|-----------------|-------------------|-----------------------------------------|--------------------|
| vadd. <b>vv</b> | V1,V2,V3          | V1=V2+V3                                | vector + vector    |
| VADD. <b>SV</b> | V1, <b>R0,</b> V2 | V1= <b>RO</b> +V2                       | scalar + vector    |
| VMUL.VV         | V1,V2,V3          | V1=V2*V3                                | vector x vector    |
| VMUL.SV         | V1,R0,V2          | V1=R0*V2                                | scalar x vector    |
| VLD             | V1,R1             | V1=M[R1R1+63]                           | load, stride=1     |
| VLD <b>S</b>    | V1,R1, <b>R2</b>  | V1=M[R1R1+63*R2]                        | load, stride=R2    |
| VLD <b>X</b>    | V1,R1, <b>V2</b>  | V1=M[R1 <b>+V2</b> <sub>i</sub> ,i=063] | indexed("gather")  |
| VST             | V1,R1             | M[R1R1+63]=V1                           | store, stride=1    |
| VST <b>S</b>    | V1,R1, <b>R2</b>  | V1=M[R1R1 <b>+63*R2</b> ]               | store, stride=R2   |
| VSTX            | V1,R1, <b>V2</b>  | V1=M[R1 <b>+V2</b> <sub>i</sub> ,i=063] | indexed("scatter") |

+ regular scalar instructions...

### Advantages of Vector ISAs

Compact: single instruction defines N operations

- Amortizes the cost of instruction fetch/decode/issue
- Also reduces the frequency of branches
- Parallel: N operations are (data) parallel
  - No dependencies
  - No need for complex hardware to detect parallelism (similar to VLIW)
  - Can execute in parallel assuming N parallel datapaths
- Expressive: memory operations describe patterns
  - Continuous or regular memory access pattern
  - Can prefetch or accelerate using wide/multi-banked memory
  - Can amortize high latency for 1st element over large sequential pattern

# Vector Length (VL)

- Basic: Fixed vector length (typical in narrow SIMD)
   Is this efficient for wide SIMD (e.g., 32-wide vectors)?
- Vector-length (VL) register: Control the length of any vector operation, including vector loads and stores
  - e.g. vadd.vv with VL=10  $\leftarrow \rightarrow$  for (i=0; i<10; i++) V1[i]=V2[i]+V3[i]
  - VL can be set up to MVL (e.g., 32)
  - How to do vectors > MVL?
  - What if VL is unknown at compile time?

## **Optimization 1: Chaining**

- □ Suppose the following code with VL=32:
  - vmul.vv V1,V2,V3
  - vadd.vv V4,V1,V5 # very long RAW hazard
- Chaining
  - V1 is not a single entity but a group of individual elements
  - Pipeline forwarding can work on an element basis
- Flexible chaining: allow vector to chain to any other active vector operation => more read/write ports



### **Optimization 2: Multiple Lanes**



### Modular, scalable design

- Elements for each vector register interleaved across the lanes
- Each lane receives identical control
- Multiple element operations executed per cycle
- No need for inter-lane communication for most vector instructions

## Chaining & Multi-Iane Example



## **Optimization 3: Conditional Execution**

Suppose you want to vectorize this: for (i=0; i<N; i++) if (A[i]!= B[i]) A[i] -= B[i];</p> 12

- □ Solution: Vector conditional execution (predication)
  - Add vector flag registers with single-bit elements (masks)
  - Use a vector compare to set the a flag register
  - Use flag register as mask control for the vector sub
    - Add executed only for vector elements with corresponding flag element set

Vector code

| vld         | V1, Ra                            |
|-------------|-----------------------------------|
| vld         | V2, Rb                            |
| vcmp.neq.vv | M0, V1, V2 # vector compare       |
| vsub.vv     | V3, V2, V1, M0 # conditional vadd |
| vst         | V3, Ra                            |

### Example: Intel Xeon Phi (Knights Corner)



□ A multi-core chip with x86-based vector processors

- Ring interconnect, private L2 caches, coherent
- Targeting the HPC market
  - Goal: high GFLOPS, GFLOPS/Watt

6.888 Spring 2013 - Sanchez and Emer - L14

## Xeon Phi Core Design



- □ 4-way threaded + vector processing
- □ In-order (why?), short pipeline
- Vector ISA: 32 vector registers (512b), 8 mask registers, scatter/gather

## **Graphics Processors Timeline**

### □ Till mid 90s

- VGA controllers used to accelerate some display functions
- Mid 90s to mid 00s
  - Fixed-function graphics accelerators for the OpenGL and DirectX APIs
    - Some GP-GPU capabilities by on top of the interfaces
  - 3D graphics: triangle setup & rasterization, texture mapping & shading
- Modern GPUs
  - Programmable multiprocessors optimized for data-parallel ops
    - OpenGL/DirectX and general purpose languages (CUDA, OpenCL, ...)
  - Some fixed-function hardware (texture, raster ops, ...)
  - Either a PCIe accelerator (discrete), or in same die as CPU (integrated)
    - Tradeoffs?



### □ GPU hardware architecture

- □ Good high-level mental model
  - GPU = Multicore chip, with highly-threaded vector cores
  - Not 100% accurate, but helpful as a SW developer

### Refresh: Software GPU Thread Model (CUDA)





Thr ead



 Single-program multiple data (SPMD) model 17

- Each thread has local memory
- Parallel threads packed in blocks
  - Access to per-block shared memory
  - Can synchronize with barrier
- Grids include independent groups
   Many execute consummently
  - May execute concurrently

# Code Example: SAXPY

#### C Code

#### CUDA Code

```
// Invoke DAXPY with 256 threads per block
__host__
int nblocks = (n+ 255) / 256;
   daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}</pre>
```

CUDA code launches 256 threads per block

- Thread = 1 iteration of scalar loop (1 element in vector loop)
- Block = body of vectorized loop (with VL=256 in this example)
- Grid = vectorizable loop

### Example: Nvidia Kepler GK110



- 15 SMX processors, shared L2, 6 memory controllers
  - 1TFLOP DP
- HW thread scheduling

6.888 Spring 2013 - Sanchez and Emer - L14

## Streaming Multiprocessor (SMX)

| SMX                                                                           | Ĺ                                                      |      |         |                   |      |      |         |       | 4                 |      |      |      |         |                   |      |      |         |       |     |
|-------------------------------------------------------------------------------|--------------------------------------------------------|------|---------|-------------------|------|------|---------|-------|-------------------|------|------|------|---------|-------------------|------|------|---------|-------|-----|
| Instruction Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler |                                                        |      |         |                   |      |      |         |       |                   |      |      |      |         |                   |      |      |         |       |     |
| Dispatch Dispatch                                                             |                                                        |      |         | Dispatch Dispatch |      |      |         |       | Dispatch Dispatch |      |      |      |         | Dispatch Dispatch |      |      |         |       |     |
|                                                                               |                                                        |      |         |                   |      |      |         |       |                   |      |      |      |         |                   |      |      |         |       |     |
| L                                                                             | Register File (65,536 x 32-bit)                        |      |         |                   |      |      |         |       |                   |      |      |      |         |                   |      |      |         |       |     |
|                                                                               |                                                        |      | DP Unit |                   | Core |      | DP Unit | LD/ST | SFU               | Core | Core |      |         |                   |      |      | DP Unit |       | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                                          | Core                                                   | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU               | Core | Core | Core | DP Unit | Core              | Core | Core | DP Unit | LD/ST | SFU |
|                                                                               | Interconnect Network<br>64 KB Shared Memory / L1 Cache |      |         |                   |      |      |         |       |                   |      |      |      |         |                   |      |      |         |       |     |
|                                                                               |                                                        |      |         |                   |      |      |         | B Rea |                   |      |      |      |         |                   |      |      |         |       |     |
|                                                                               | Tex                                                    |      | Tex     |                   |      | Tex  |         | Tex   | (                 |      | Tex  |      | Тех     | (                 |      | Tex  |         | Tex   |     |
|                                                                               | Tex                                                    |      | Tex     |                   |      | Tex  |         | Tex   | (                 |      | Tex  |      | Tex     |                   | Tex  |      |         | Tex   |     |

- Cores are
  - Multithreded
  - Data parallel
- Capabilities
  - 64K registers
  - 192 simple cores
    - Int and SP FPU
  - 64 DP FPUs
  - 32 LSUs, 32 SFUs
- Scheduling
  - 4 warp schedulers
  - 2 inst dispatch per warp

6.888 Spring 2013 - Sanchez and Emer - L14



- □ In theory, all threads can be independent
  - HW implements zero-overhead switching
  - For efficiency, 32 threads are packed in warps
    - Warp: set of parallel threads the execute same instruction
      - Wrap = a thread of vector instructions
      - Warps introduce data parallelism
    - 1 warp instruction keeps cores busy for multiple cycles
- Individual threads may be inactive
  - Because they branched differently
  - This is the equivalent of conditional execution (but implicit)
  - Loss of efficiency if not data parallel
- SW thread blocks mapped to warps
  - When HW resources are available

# Warp Scheduling



- 64 warps per SMX
- 32 threads per warp
  - 64K registers/SMX
  - Up to 255 registers per thread (8 warps)

#### Scheduling

- 4 schedulers select 1 warp per cycle
- 2 independent instructions issued per warp (double-pumped FUs)
- Total bandwidth = 4 \* 2 \* 32 = 256 ops per cycle
- Register scoreboarding
  - To track ready instructions
  - Simplified using static latencies
    - Binary incompatibility?

## Hardware Scheduling



- HW unit schedules grids on SMX
  - Priority based scheduling
- 32 active grids
  - More queued/paused
- Grids launched by CPU or GPU
  - Work from multiple CPU cores

## Memory Hierarchy



- Each SMX has 64KB of memory
  - Split between shared mem and L1 cache
    - 16/48, 32/32, 48/16
  - 256B per access
- 48KB read-only data cache
- 1.5MB shared L2
  - Supports synchronization operations (atomicCAS, atomicADD, ...)
- Throughput-oriented main memory
  - Memory scheduling? TCM-like?
  - GDDRx standards

### **Paper Discussions**

- DWF, Fung et al., MICRO'07
- RF/WS, Gebhart et al., ISCA'11

### Lost in Translation: Vector vs GPU



6.888 Spring 2013 - Sanchez and Emer - L14

### Lost in Translation: Vector vs GPU

| Туре                 | More descrip-<br>tive name             | Closest old term<br>outside of GPUs           | Official CUDA/<br>NVIDIA GPU term | Book definition                                                                                                                                                             |                               |
|----------------------|----------------------------------------|-----------------------------------------------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| suo                  | Vectorizable<br>Loop                   | Vectorizable Loop                             | Grid                              | A vectorizable loop, executed on the GPU, made<br>up of one or more Thread Blocks (bodies of<br>vectorized loop) that can execute in parallel.                              |                               |
| Program abstractions | Body of<br>Vectorized Loop             | Body of a<br>(Strip-Mined)<br>Vectorized Loop | Thread Block                      | A vectorized loop executed on a multithreaded<br>SIMD Processor, made up of one or more threads<br>of SIMD instructions. They can communicate via<br>Local Memory.          |                               |
| Progra               | Sequence of<br>SIMD Lane<br>Operations | One iteration of<br>a Scalar Loop             | CUDA Thread                       | A vertical cut of a thread of SIMD instructions<br>corresponding to one element executed by one<br>SIMD Lane. Result is stored depending on mask<br>and predicate register. | From Computer                 |
| Machine object       | A Thread of<br>SIMD<br>Instructions    | Thread of Vector<br>Instructions              | Warp                              | A traditional thread, but it contains just SIMD<br>instructions that are executed on a multithreaded<br>SIMD Processor. Results stored depending on a<br>per-element mask.  | Architecture, 4 <sup>th</sup> |
| Mach                 | SIMD<br>Instruction                    | Vector Instruction                            | PTX Instruction                   | A single SIMD instruction executed across SIMD<br>Lanes.                                                                                                                    | edition by J.                 |
| Processing hardware  | Multithreaded<br>SIMD<br>Processor     | (Multithreaded)<br>Vector Processor           | Streaming<br>Multiprocessor       | A multithreaded SIMD Processor executes<br>threads of SIMD instructions, independent of<br>other SIMD Processors.                                                           | Hennessy and D                |
|                      | Thread Block<br>Scheduler              | Scalar Processor                              | Giga Thread<br>Engine             | Assigns multiple Thread Blocks (bodies of vectorized loop) to multithreaded SIMD Processors.                                                                                | Patterson                     |
|                      | SIMD Thread<br>Scheduler               | Thread scheduler<br>in a Multithreaded<br>CPU | Warp Scheduler                    | Hardware unit that schedules and issues threads<br>of SIMD instructions when they are ready to<br>execute; includes a scoreboard to track SIMD<br>Thread execution.         |                               |
|                      | SIMD Lane                              | Vector Lane                                   | Thread Processor                  | A SIMD Lane executes the operations in a thread<br>of SIMD instructions on a single element. Results<br>stored depending on mask.                                           |                               |
| Memory hardware      | GPU Memory                             | Main Memory                                   | Global Memory                     | DRAM memory accessible by all multithreaded<br>SIMD Processors in a GPU.                                                                                                    |                               |
|                      | Private<br>Memory                      | Stack or Thread<br>Local Storage (OS)         | Local Memory                      | Portion of DRAM memory private to each SIMD Lane.                                                                                                                           |                               |
|                      | Local Memory                           | Local Memory                                  | Shared Memory                     | Fast local SRAM for one multithreaded SIMD<br>Processor, unavailable to other SIMD Processors.                                                                              |                               |
|                      | SIMD Lane<br>Registers                 | Vector Lane<br>Registers                      | Thread Processor<br>Registers     | Registers in a single SIMD Lane allocated across<br>a full thread block (body of vectorized loop).<br>5.888 Spring 2013 - Sanchez and Emer - L14                            |                               |

27

D.